kubernetes / git-sync

A sidecar app which clones a git repo and keeps it in sync with the upstream.
Apache License 2.0
2.21k stars 410 forks source link

Issue with Git-sync repo inside a repo, Nested repo for airflow files #913

Closed Zhihui-Ellen-Jiang closed 1 month ago

Zhihui-Ellen-Jiang commented 1 month ago

This is my airflow dags with git-sync, they are in the **/opt/airflow/dags/repo**. However, I have not made any changes to my destination. It was working fine this morning, but then in the afternoon, it got into the repo. It is really weird.

**-ltmxfwn Airflow-on-KinD % kubectl exec -it airflow-scheduler-5587b6795-vwhg5 -n airflow -- ls /opt/airflow/dags/repo
Defaulted container "scheduler" out of: scheduler, git-sync, scheduler-log-groomer, wait-for-airflow-migrations (init), git-sync-init (init)
100.py  fibo.py  haha.py**

Also, I have tried a way to change my mountPath: /opt/airflow/dags/repo, but it got worse. Now my airflow dags are in /opt/airflow/dags/repo/repo. Somehow it is creating nested repo for my Airflow dag files, and I am unable to sync my file to Airflow. The following is the section for Git sync in my values.yaml file for Airflow, I have not changed anything for my config. Now it is set back to the one was working before anything stopped syncing:

# Git sync
dags:
  # Where dags volume will be mounted. Works for both persistence and gitSync.
  # If not specified, dags mount path will be set to $AIRFLOW_HOME/dags
  mountPath: ~
  persistence:
    # Annotations for dags PVC
    annotations: {}
    # Enable persistent volume for storing dags
    enabled: false
    # Volume size for dags
    size: 1Gi
    # If using a custom storageClass, pass name here
    storageClassName:
    # access mode of the persistent volume
    accessMode: ReadWriteOnce
    ## the name of an existing PVC to use
    existingClaim:
    ## optional subpath for dag volume mount
    subPath: ~
  gitSync:
    enabled: true

    # git repo clone url
    # ssh example: git@github.com:apache/airflow.git
    # https example: https://github.com/apache/airflow.git
    repo: https://github.com/xxxx.git
    branch: main
    rev: HEAD
    # The git revision (branch, tag, or hash) to check out, v4 only
    # ref: main
    depth: 1
    # the number of consecutive failures allowed before aborting
    maxFailures: 0
    # subpath within the repo where dags are located
    # should be "" if dags are at repo root
    subPath: ""
    # if your repo needs a user name password
    # you can load them to a k8s secret like the one below
    #   ---
    #   apiVersion: v1
    #   kind: Secret
    #   metadata:
    #     name: git-credentials
    #   data:
    #     # For git-sync v3
    #     GIT_SYNC_USERNAME: <base64_encoded_git_username>
    #     GIT_SYNC_PASSWORD: <base64_encoded_git_password>
    #     # For git-sync v4
    #     GITSYNC_USERNAME: <base64_encoded_git_username>
    #     GITSYNC_PASSWORD: <base64_encoded_git_password>
    # and specify the name of the secret below
    #
    # credentialsSecret: git-credentials
    #
    #
    # If you are using an ssh clone url, you can load
    # the ssh private key to a k8s secret like the one below
    #   ---
    #   apiVersion: v1
    #   kind: Secret
    #   metadata:
    #     name: airflow-ssh-secret
    #   data:
    #     # key needs to be gitSshKey
    #     gitSshKey: <base64_encoded_data>
    # and specify the name of the secret below
    sshKeySecret: airflow-ssh-git-secret
    #
    # If you are using an ssh private key, you can additionally
    # specify the content of your known_hosts file, example:
    #
    # knownHosts: |
    #    <host1>,<ip1> <key1>
    #    <host2>,<ip2> <key2>

    # interval between git sync attempts in seconds
    # high values are more likely to cause DAGs to become out of sync between different components
    # low values cause more traffic to the remote git repository
    # Go-style duration string (e.g. "100ms" or "0.1s" = 100ms).
    # For backwards compatibility, wait will be used if it is specified.
    period: 10s
    wait: ~

    containerName: git-sync
    uid: 65533

    # When not set, the values defined in the global securityContext will be used
    securityContext: {}
    #  runAsUser: 65533
    #  runAsGroup: 0

    securityContexts:
      container: {}

    # container level lifecycle hooks
    containerLifecycleHooks: {}

    # Mount additional volumes into git-sync. It can be templated like in the following example:
    #   extraVolumeMounts:
    #     - name: my-templated-extra-volume
    #       mountPath: "{{ .Values.my_custom_path }}"
    #       readOnly: true
    extraVolumeMounts: []
    env: []
    # Supported env vars for gitsync can be found at https://github.com/kubernetes/git-sync
    # - name: ""
    #   value: ""

    # Configuration for empty dir volume
    # emptyDirConfig:
    #   sizeLimit: 1Gi
    #   medium: Memory

    resources: 
      limits:
        cpu: 1000m  # Increase to 1 CPU
        memory: 1Gi  # Increase to 1 GiB
      requests:
        cpu: 500m  # Increase to 0.5 CPU
        memory: 512Mi  # Increase to 512 MiB
thockin commented 1 month ago

I'm sorry. Can you help me understand this a little more. Are you saying that it was all working fine and then you went away from it and came back and then it wasn't working anymore?

Did you look at the logs for gitsync?

If you run 'ls -l' in the gitsync root, it will show you the hash that it has checked out.

Can you also say which version of gitsync you're using? If you can post a full set of logs preferably with –v 6. It would really help.

Zhihui-Ellen-Jiang commented 1 month ago

hello it was syncing fine then it stopped. I guess I can't say it "stopped", but syncing to the wrong destination. I noticed because I was in the middle of making changes to my github repo Airflow DAGs, I have synced successfully a couple of times in a straight sitting, and then it stopped showing up correctly on Airflow UI. I have checked and it was synced to the "repo" as/opt/airflow/dags/repo. but it suppose to be synced to /opt/airflow/dags

jiang@zuij-ltmxfwn Airflow-on-KinD % kubectl logs airflow-scheduler-c4897b579-2hrb4 -n airflow -c git-sync
INFO: detected pid 1, running init handler
{"logger":"","ts":"2024-07-26 02:39:07.121457","caller":{"file":"main.go","line":361},"level":0,"msg":"setting --ref from deprecated --branch"}
{"logger":"","ts":"2024-07-26 02:39:07.121631","caller":{"file":"main.go","line":393},"level":0,"msg":"setting --link from deprecated --dest"}
{"logger":"","ts":"2024-07-26 02:39:07.121703","caller":{"file":"main.go","line":523},"level":0,"msg":"starting up","pid":12,"uid":65533,"gid":65533,"home":"/tmp","flags":["--add-user=true","--branch=main","--change-permissions=0","--cookie-file=false","--credential=[]","--depth=1","--dest=repo","--exechook-backoff=3s","--exechook-timeout=30s","--git=git","--git-gc=always","--group-write=false","--help=false","--http-metrics=false","--http-pprof=false","--link=repo","--man=false","--max-failures=0","--max-sync-failures=0","--one-time=false","--period=10s","--ref=main","--repo=https://github.com/xxx/airflow-dags.git","--rev=HEAD","--root=/git","--ssh=false","--ssh-key-file=[/etc/git-secret/ssh]","--ssh-known-hosts=false","--ssh-known-hosts-file=/etc/git-secret/known_hosts","--stale-worktree-timeout=0s","--submodules=recursive","--sync-timeout=2m0s","--timeout=0","--v=-1","--verbose=0","--version=false","--wait=0","--webhook-backoff=3s","--webhook-method=POST","--webhook-success-status=200","--webhook-timeout=1s"]}
{"logger":"","ts":"2024-07-26 02:39:07.358613","caller":{"file":"main.go","line":1639},"level":0,"msg":"update required","ref":"main","local":"5d24c4f15062910380fce812c0aeb4567e71e519","remote":"5d24c4f15062910380fce812c0aeb4567e71e519","syncCount":0}
{"logger":"","ts":"2024-07-26 02:39:07.684549","caller":{"file":"main.go","line":1690},"level":0,"msg":"updated successfully","ref":"main","remote":"5d24c4f15062910380fce812c0aeb4567e71e519","syncCount":1}
Zhihui-Ellen-Jiang commented 1 month ago
`ihui.jiang@zhihuij-ltmxfwn Airflow-on-KinD % kubectl describe pod airflow-scheduler-c4897b579-2hrb4 -n airflow | grep -A 5 "git-sync"

  git-sync-init:
    Container ID:   containerd://2f4a8e4bac3fe7371c76dcc2ff7f99b63444a5f31d97f1d36298ea36b385fae7
    Image:          registry.k8s.io/git-sync/git-sync:v4.1.0
    Image ID:       registry.k8s.io/git-sync/git-sync@sha256:fd9722fd02e3a559fd6bb4427417c53892068f588fc8372aa553fbf2f05f9902
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
--
      /etc/git-secret/ssh from git-sync-ssh-key (ro,path="gitSshKey")
      /git from dags (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ngkzw (ro)
Containers:
  scheduler:
    Container ID:  containerd://1ae2434d875776a45c0ef04defa16f33c8cc7d5313da8193b0b6c398cd00014e
--
  git-sync:
    Container ID:   containerd://52db865fe762253a99ec89ad28f8937a5433dcde5a690115d6b325c5fee2bb09
    Image:          registry.k8s.io/git-sync/git-sync:v4.1.0
    Image ID:       registry.k8s.io/git-sync/git-sync@sha256:fd9722fd02e3a559fd6bb4427417c53892068f588fc8372aa553fbf2f05f9902
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Thu, 25 Jul 2024 19:39:07 -0700
    Ready:          True
--
      /etc/git-secret/ssh from git-sync-ssh-key (ro,path="gitSshKey")
      /git from dags (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ngkzw (ro)
  scheduler-log-groomer:
    Container ID:  containerd://6d91b3ac1e0ab559da361d5bfda506f5b5ee8bc992d0761d9ea5cee8452aad44
    Image:         apache/airflow:2.9.2
--
  git-sync-ssh-key:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  airflow-ssh-git-secret
    Optional:    false
  logs:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
--
  Normal   Pulled     8m43s  kubelet            Container image "registry.k8s.io/git-sync/git-sync:v4.1.0" already present on machine
  Normal   Created    8m43s  kubelet            Created container git-sync-init
  Normal   Started    8m43s  kubelet            Started container git-sync-init
  Normal   Pulled     8m42s  kubelet            Container image "apache/airflow:2.9.2" already present on machine
  Normal   Created    8m42s  kubelet            Created container scheduler
  Normal   Started    8m41s  kubelet            Started container scheduler
  Normal   Pulled     8m41s  kubelet            Container image "registry.k8s.io/git-sync/git-sync:v4.1.0" already present on machine
  Normal   Created    8m41s  kubelet            Created container git-sync
  Normal   Started    8m41s  kubelet            Started container git-sync
  Normal   Pulled     8m41s  kubelet            Container image "apache/airflow:2.9.2" already present on machine
  Normal   Created    8m41s  kubelet            Created container scheduler-log-groomer
  Normal   Started    8m41s  kubelet            Started container scheduler-log-groomer
  Warning  Unhealthy  8m32s  kubelet            Startup probe failed: /home/airflow/.local/lib/python3.12/site-packages/airflow/metrics/statsd_logger.py:184 RemovedInAirflow3Warning: The basic metric validator will be deprecated in the future in favor of pattern-matching.  You can try this now by setting config option metrics_use_pattern_match to True.
No alive jobs found.
`
thockin commented 1 month ago

That log shows this as the imporant flags:

git-sync "--depth=1" "--link=repo" "--ref=main" "--root=/git"

That is going to sync the repo into /git, and publish the worktree at /git/repo (which will be a symlink to a local directory named after the git SHA).

I don't know what /opt/airflow/dags is or where it comes from. You should run ls -l on all of those intermediate directories - I think there's symlink shenanigans going on.

Also 4.1.0 is pretty old - a lot of bugs have been fixed since then. Current is 4.2.4

Zhihui-Ellen-Jiang commented 1 month ago

Thank you for the suggestions and I have followed the steps. It is still not showing up on my Airflow UI. I have upgraded to 4.2.4, and somehow this is showing up with a runtime error when getting the DAG directory RuntimeError: Detected recursive loop when walking DAG directory /opt/airflow/dags: /opt/airflow/dags/repo/.worktrees/5d24c4f15062910380fce812c0aeb4567e71e519 has appeared more than once.

Logs are here:

`[2024-07-26T14:51:43.459+0000] {manager.py:272} WARNING - DagFileProcessorManager (PID=128) exited with exit code 1 - re-launching
[2024-07-26T14:51:43.462+0000] {manager.py:170} INFO - Launched DagFileProcessorManager with pid: 129
[2024-07-26T14:51:43.468+0000] {settings.py:60} INFO - Configured default timezone UTC
[2024-07-26T14:51:43.498+0000] {settings.py:518} INFO - Loaded airflow_local_settings from /opt/airflow/config/airflow_local_settings.py .
Process ForkProcess-73:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/dag_processing/manager.py", line 241, in _run_processor_manager
    processor_manager.start()
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/dag_processing/manager.py", line 476, in start
    return self._run_parsing_loop()
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/dag_processing/manager.py", line 549, in _run_parsing_loop
    self._refresh_dag_dir()
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/dag_processing/manager.py", line 738, in _refresh_dag_dir
    self._file_paths = list_py_file_paths(self._dag_directory)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/utils/file.py", line 298, in list_py_file_paths
    file_paths.extend(find_dag_file_paths(directory, safe_mode))
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/utils/file.py", line 311, in find_dag_file_paths
    for file_path in find_path_from_directory(directory, ".airflowignore"):
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/utils/file.py", line 241, in _find_path_from_directory
    raise RuntimeError(
RuntimeError: Detected recursive loop when walking DAG directory /opt/airflow/dags: /opt/airflow/dags/repo/.worktrees/5d24c4f15062910380fce812c0aeb4567e71e519 has appeared more than once.
`

And my Dag files are in /opt/airflow/dags/repo/.worktrees/5d24c4f15062910380fce812c0aeb4567e71e519 as shown below

`hui.jiang@zhihuij-ltmxfwn Airflow-on-KinD % kubectl exec -it airflow-scheduler-654996d476-w9b6q -n airflow -- ls -l /opt/airflow/dags/repo/.worktrees/5d24c4f15062910380fce812c0aeb4567e71e519

Defaulted container "scheduler" out of: scheduler, git-sync, scheduler-log-groomer, wait-for-airflow-migrations (init), git-sync-init (init)
total 8
-rw-r--r-- 1 65533 root  973 Jul 26 14:50 100.py
-rw-r--r-- 1 65533 root 1566 Jul 26 14:50 fibo.py`
thockin commented 1 month ago

Can you run ls -ld on each of /opt /opt/airflow /opt/airflow/dags and opt/airflow/dags/repo? Maybe also on /git

Also, if you can, run git-sync with -v 6 - the short logs you posted don't line up with what you are saying.

Zhihui-Ellen-Jiang commented 1 month ago

hello, thank you for helping me. I have solved this issue. I change the dags_folder = /opt/airflow/dags/repo and it correctly mounted