kubernetes / git-sync

A sidecar app which clones a git repo and keeps it in sync with the upstream.
Apache License 2.0
2.14k stars 409 forks source link

Sync AWS CodeCommit + SparkApplication #808

Closed lucasmsmedeiros closed 9 months ago

lucasmsmedeiros commented 9 months ago

Hello everyone!

I'm running a spark-operator on k8s and I need to synchronize my AWS CodeCommit repository directly so I can import my python modules and not have to build the images with them encapsulated in it. I've already used sync with GitHub and deploying SSH to the namespace. However, I am trying to sync with AWS credentials according to the yaml below:

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: teste-sync-{{ macros.datetime.now().strftime("%Y-%m-%d-%H-%M-%S") }}
  namespace: processing
spec:
  volumes:
    - name: ivy
      emptyDir: {}
  sparkConf:
    extraJavaOptions: -Dcom.amazonaws.services.s3.enableV4=true
    spark.jars.packages: "org.apache.hadoop:hadoop-aws:3.2.0,org.apache.spark:spark-avro_2.12:3.0.1"
    spark.driver.extraJavaOptions: "-Divy.cache.dir=/tmp -Divy.home=/tmp"
    spark.kubernetes.allocation.batch.size: "10"
    spark.sql.debug.maxToStringFields: "2000"
  hadoopConf:
    "fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
    "fs.s3a.path.style.access": "True"
    "fs.s3a.connection.ssl.enabled": "True"
  type: Python
  pythonVersion: "3"
  mode: cluster
  image: url_spark_image
  imagePullPolicy: Always
  mainApplicationFile: teste-sync.py
  sparkVersion: "3.1.2"
  restartPolicy:
    type: Never
  volumes:
    - name: ivy
      emptyDir: {}
    - name: scripts
      emptyDir: {}
  driver:
    volumeMounts:
      - name: scripts
        mountPath: /git-sync
    initContainers:
      - name: git-sync
        image: "k8s.gcr.io/git-sync/git-sync:v3.6.1"
        imagePullPolicy: IfNotPresent
        volumeMounts:
          - name: scripts
            mountPath: /scripts
        env:
          - name: GIT_SYNC_REPO
            value: "https://git-codecommit.MY_REGION.amazonaws.com/v1/repos/MY_REPO"
          - name: GIT_SYNC_BRANCH
            value: "master"   
          - name: GIT_SYNC_ROOT
            value: /dags
          - name: GIT_SYNC_DEST
            value: "main"
          - name: GIT_SYNC_ONE_TIME
            value: "true"
          - name: GIT_SYNC_SSH
            value: "false"
          - name: GIT_SYNC_AUTH
            value: "basic"   
          - name: AWS_ACCESS_KEY_ID
            valueFrom:
              secretKeyRef:
                name: aws-credentials
                key: aws_access_key_id
          - name: AWS_SECRET_ACCESS_KEY
            valueFrom:
              secretKeyRef:
                name: aws-credentials
                key: aws_secret_access_key           
    env:
      - name: PYTHONPATH
        value: "$PYTHONPATH:/git-sync/main/scripts"              
    envSecretKeyRefs:
      AWS_ACCESS_KEY_ID:
        name: aws-credentials
        key: aws_access_key_id
      AWS_SECRET_ACCESS_KEY:
        name: aws-credentials
        key: aws_secret_access_key
    cores: 1
    coreLimit: "1200m"
    memory: "2g"
    labels:
      version: 3.1.2
    serviceAccount: spark
    volumeMounts:
      - name: ivy
        mountPath: /tmp
  executor:
    envSecretKeyRefs:
      AWS_ACCESS_KEY_ID:
        name: aws-credentials
        key: aws_access_key_id
      AWS_SECRET_ACCESS_KEY:
        name: aws-credentials
        key: aws_secret_access_key
    cores: 1
    instances: 2
    memory: "3g"
    labels:
      version: 3.1.2
    volumeMounts:
      - name: ivy
        mountPath: /tmp

From the tests I did it's not working. Can anyone help me? Is there a problem with yaml or will this type of authentication not work and will I have to deploy SSH?

thockin commented 9 months ago

A few things:

1) GIT_SYNC_AUTH is not a thing

2) since your REPO is "https://" I assume you want to pass GIT_SYNC_USERNAME and either GIT_SYNC_PASSWORD or GIT_SYNC_PASSWORD_FILE. It looks like you have those but in variable names that git-sync would have no way to know about.

3) If you look at logs I bet you will see something indicating an auth failure.

thockin commented 9 months ago

I'm going to close this for now. Let me know if you can't make it work still. The logs will show you the flags it used - if the username and password are not know, it can't pass them to basicauth.

lucasmsmedeiros commented 9 months ago

Hi, @thockin!

Did the changes you propose and I still can't make it work...

My yaml now:

apiVersion: v1
kind: Pod
metadata:
  name: "{{APP_NAME}}"
  namespace: orchestrator
spec:
  containers:
    - name: python-container
      image: "{{PYTHON_IMAGE}}"
      imagePullPolicy: IfNotPresent
      securityContext:
        allowPrivilegeEscalation: false
        runAsUser: 0
      command:
        - "python"
        - "/opt/app/{{API_FILE_PATH}}"
      volumeMounts:
        - name: dags
          mountPath: /git-sync        
  initContainers:
  - name: git-sync
    image: "k8s.gcr.io/git-sync/git-sync:v3.6.1"
    imagePullPolicy: IfNotPresent
    volumeMounts:
      - name: dags
        mountPath: /dags
    env:
      - name: GIT_SYNC_REPO
        value: "https://git-codecommit.<my_region>.amazonaws.com/v1/repos/<my_repo>"
      - name: GIT_SYNC_BRANCH
        value: "master"   
      - name: GIT_SYNC_ROOT
        value: /dags
      - name: GIT_SYNC_DEST
        value: "master"
      - name: GIT_SYNC_ONE_TIME
        value: "true"
      - name: GIT_SYNC_USERNAME
        valueFrom:
          secretKeyRef:
            name: aws-credentials
            key: aws_access_key_id
      - name: GIT_SYNC_PASSWORD
        valueFrom:
          secretKeyRef:
            name: aws-credentials
            key: aws_secret_access_key
  volumes:
    - name: dags
      emptyDir: {}

The error:

INFO: detected pid 1, running init handler I0922 14:20:38.033790 11 main.go:389] "level"=0 "msg"="starting up" "pid"=11 "args"=["/git-sync"] I0922 14:20:38.044278 11 main.go:934] "level"=0 "msg"="cloning repo" "origin"="https://git-codecommit..amazonaws.com/v1/repos/" "path"="/dags" E0922 14:20:38.136591 11 main.go:535] "msg"="too many failures, aborting" "error"="Run(git clone -v --no-checkout -b master https://git-codecommit..amazonaws.com/v1/repos/ /dags): exit status 128: { stdout: "", stderr: "Cloning into '/dags'...\nfatal: unable to access 'https://git-codecommit..amazonaws.com/v1/repos//': The requested URL returned error: 403" }" "failCount"=

The Permissions policies of the user:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "SecretsManagerFullAccess",
            "Effect": "Allow",
            "Action": "secretsmanager:*",
            "Resource": "*"
        },
        {
            "Sid": "ECRAccess",
            "Effect": "Allow",
            "Action": [
                "ecr:GetAuthorizationToken",
                "ecr:BatchCheckLayerAvailability",
                "ecr:GetDownloadUrlForLayer",
                "ecr:BatchGetImage",
                "ecr:DescribeRepositories",
                "ecr:ListImages",
                "ecr:DescribeImages",
                "ecr:GetRepositoryPolicy",
                "ecr:ListTagsForResource",
                "ecr:DescribeImageScanFindings"
            ],
            "Resource": "*"
        },
        {
            "Sid": "CodeCommitFullAccess",
            "Effect": "Allow",
            "Action": "codecommit:*",
            "Resource": "*"
        }
    ]
}
thockin commented 9 months ago

A few things to do:

1) Can you manually prove that the username and password are correct (no trailing newline or anything) by doing git clone https://user:pass@server... ? 2) Run git-sync with -v 6 and see exactly which git commands it is running. 3) If that looks right, consider trying git-sync v4.0.0 and -v 9 which will log more useful info about flags and the md5sums of credentials.