stderr said "stat /etc/git-secret/known_hosts: no such file or directory" after setting GIT_SYNC_KNOWN_HOSTS=false

eugeneYWang commented 1 year ago

Problem:

In a compose file that does not worry about middle-man-attack, I set - GIT_SYNC_KNOWN_HOSTS=false, but the docker still complains messages like

docker-compose-git-sync-service-1  | INFO: detected pid 1, running init handler
docker-compose-git-sync-service-1  | I0228 23:57:21.788120      12 main.go:389] "level"=0 "msg"="starting up" "pid"=12 "args"=["/git-sync"]
docker-compose-git-sync-service-1  | ERROR: can't configure SSH: can't access SSH known_hosts: stat /etc/git-secret/known_hosts: no such file or directory

Context:

I am using this docker image as a service in a docker compose file along with other airflow services. In order to sync Dag folder and other folders with a dedicated repo.

This is the git-sync service in my compose file:

  # have not tested for scenes when ./dags ./plugins are not created.
  # prerequisite: 
    # a ssh keypair file on the host path
  git-sync-service:
    image: k8s.gcr.io/git-sync/git-sync:v3.6.4
    profiles:
      - sync-dag
    environment:
      - GIT_SYNC_REPO=${GIT_SYNC_DAG_REPO}
      - GIT_SYNC_BRANCH=${GIT_SYNC_DAG_BRANCH}
      - GIT_SYNC_ADD_USER=${AIRFLOW_UID:-50000}
      # -1 below means it will retry forever. TODO: discuss this value for UAT setup
      # Note: decide that maybe one-time sync is better for local setup
      # - GIT_SYNC_MAX_FAILURES = -1
      - GIT_SYNC_ROOT=/git
      - GIT_SYNC_SSH=true
      - GIT_SYNC_KNOWN_HOSTS=false
      - GIT_SYNC_SSH_KEY_FILE=/etc/git-secret/ssh
      - GIT_SYNC_ONE_TIME=true
      - GIT_SYNC_DEPTH=1
    volumes:
      - git-sync-root-volumn:/git
      - ./dags:/git/dags
      - ./plugins:/git/plugins
      - ${GIT_SSH_KEY_FILE_PATH}:/etc/git-secret/ssh

    # other default values: 10s for sync interval($GIT_SYNC_PERIOD), HEAD for checkout revision($GIT_SYNC_REV)

volumes:
  postgres-db-volume:
  mysql-db-volumn:
  git-sync-root-volumn:

eugeneYWang commented 1 year ago

BTW, if I understand the README of this repo correctly, the way I mounted two volumns as two sub-folders under /git is fine, right?

regards.

eugeneYWang commented 1 year ago

Update:

I have made a workaround to populate an empty known_hosts file. Here is my custom docker file.

FROM k8s.gcr.io/git-sync/git-sync:v3.6.4
USER root
RUN mkdir -p /etc/git-secret/
RUN ssh-keyscan github.com > /etc/git-secret/known_hosts
USER 65533:65533

And this is the latest git-sync service:

  git-sync-service:
    # image: k8s.gcr.io/git-sync/git-sync:v3.6.4
    build:
      context: ./vendor-dockerfile/git-sync
      dockerfile: Dockerfile
    user: "${AIRFLOW_UID:-50000}:0"
    profiles:
      - sync-dag
    environment:
      - GIT_SYNC_REPO=${GIT_SYNC_DAG_REPO}
      - GIT_SYNC_BRANCH=${GIT_SYNC_DAG_BRANCH}
      # -1 below means it will retry forever. TODO: discuss this value for UAT setup
      # Note: decide that maybe one-time sync is better for local setup
      # - GIT_SYNC_MAX_FAILURES = -1
      - GIT_SYNC_ROOT=/git
      - GIT_SYNC_SSH=true
      # - GIT_SYNC_KNOWN_HOSTS=false
      - GIT_SYNC_SSH_KEY_FILE=/etc/git-secret/ssh
      - GIT_SYNC_ONE_TIME=true
      - GIT_SYNC_DEPTH=1
    volumes:
      # - git-sync-root-volumn:/git
      - ./dags:/git/dags
      - ./plugins:/git/plugins
      - ${GIT_SSH_KEY_FILE_PATH}:/etc/git-secret/ssh

    # other default values: 10s for sync interval($GIT_SYNC_PERIOD), HEAD for checkout revision($GIT_SYNC_REV)

volumes:
  postgres-db-volume:
  mysql-db-volumn:
  # git-sync-root-volumn:

eugeneYWang commented 1 year ago

With the updated context, my new problem is:

How may I map my external dirs ./dags, ./plugins into two sub-dirs in the repo /git/dags, /git/plugins?

The problem returns:

docker-compose-git-sync-service-1  | INFO: detected pid 1, running init handler
docker-compose-git-sync-service-1  | I0301 00:48:41.924057      12 main.go:389] "level"=0 "msg"="starting up" "pid"=12 "args"=["/git-sync"]
docker-compose-git-sync-service-1  | I0301 00:48:41.938694      12 main.go:934] "level"=0 "msg"="cloning repo" "origin"="git@github.com:(my team)/(dag-repo).git" "path"="/git"
docker-compose-git-sync-service-1  | I0301 00:48:41.941094      12 main.go:940] "level"=0 "msg"="git root exists and is not empty (previous crash?), cleaning up" "path"="/git"
docker-compose-git-sync-service-1  | E0301 00:48:41.944425      12 main.go:535] "msg"="too many failures, aborting" "error"="unlinkat /git/dags: permission denied" "failCount"=1
docker-compose-git-sync-service-1 exited with code 1

eugeneYWang commented 1 year ago

I found a way to pull git-repo content down . However, the symlink created by git-sync has made Airflow scheduler fallen into a recursive loop of seeing the rev folder and the symlink re-directing back to the rev folder.

In this way, the output of git-sync cannot be used by Airflow directly in the context of docker compose.

Do you have any suggestion for my use case?

Attaching a working docker-compose service:

  # have not tested for scenes when ./dags ./plugins are not created.
  # prerequisite: 
    # a ssh keypair file on the host path
  git-sync-service:
    # image: k8s.gcr.io/git-sync/git-sync:v3.6.4
    build:
      context: ./vendor-dockerfile/git-sync
      dockerfile: Dockerfile
    user: "${AIRFLOW_UID:-50000}:0"
    profiles:
      - sync-dag
    environment:
      - GIT_SYNC_REPO=${GIT_SYNC_DAG_REPO}
      - GIT_SYNC_BRANCH=${GIT_SYNC_DAG_BRANCH}
      - GIT_SYNC_ADD_USER=true
      # -1 below means it will retry forever. TODO: discuss this value for UAT setup
      # Note: decide that maybe one-time sync is better for local setup
      # - GIT_SYNC_MAX_FAILURES = -1
      - GIT_SYNC_ROOT=/tmp
      - GIT_SYNC_SSH=true
      - GIT_SYNC_SSH_KEY_FILE=/etc/git-secret/ssh
      - GIT_SYNC_ONE_TIME=true
      - GIT_SYNC_DEPTH=0
    volumes:
      # - git-sync-root-volumn:/dags
      - ./dags:/tmp
      # - ./plugins:/git/plugins
      - ${GIT_SSH_KEY_FILE_PATH}:/etc/git-secret/ssh

eugeneYWang commented 1 year ago

Seems like https://github.com/kubernetes/git-sync/pull/285/files#diff-d2806be266240884b5b57b84f3ea9d75a7d502c494f094d7a7bb9d5161dd1e39

and https://github.com/kubernetes/git-sync/issues/314#issuecomment-740781574

above two links would help me do the post jobs!

Any more tips about that hook script is appreciated!

thockin commented 1 year ago

I set - GIT_SYNC_KNOWN_HOSTS=false

In v3 this is called GIT_KNOWN_HOSTS - it's a "bug" of sorts that the name is inconsistent. Flags are better than env vars, IMO.

WRT the rest, there are a lot of questions being asked at the same time.

git-sync needs to own whatever directory you give it as the root. The root itself might be a volume, but git-sync wants to own everything in it, which means you can't put other volumes underneath it.

git-sync always "publishes" a synced repo via a symlink. You can put thsat symlink anywhere that git-sync is able to write - for example, you can use "/tmp/git-sync" as the --root, and "/git/dags" as the --dest. /git/dags will be symlink to /tmp/git-syc/<something you should not hardcode> which is the "current" state of the git repo. If you readlink the link and take the basename, it will give you the git SHA.

thockin commented 1 year ago

I'm going to close this - I don't think it's a "bug" - but I am happy to continue to discuss how to solve the usage pattern you want

eugeneYWang commented 1 year ago

@thockin Thank you for giving tips.

As of my usage pattern, I think I will have to hack my own way using the exechook_command option, to copy the git data to my mounted host column.

some irrelevant thought process for people stepping into this post in the future.

just as https://github.com/kubernetes/git-sync/issues/314, we need actual file and a symlink is not going to work in the context of docker compose. I tried to use a named docker volumn as the root dir of git-sync, but I got blocked like https://github.com/kubernetes/git-sync/issues/245, even if I found a stack overflow post that is supposed to solve it. But it still does not work for me.

thockin commented 1 year ago

The git-sync container is designed to do the right thing by default wrt volumes - you should only have issues if you change the user it runs as or the volume mount path, in which case there's not much it can do automatically.

yan-hic commented 1 year ago

@thockin pls consider reopening. as 3.6.5 ignores the flag for the setting instead of env:

INFO: detected pid 1, running init handler
I0413 14:28:01.318302      13 main.go:401] "level"=0 "msg"="starting up" "pid"=13 "args"=["/git-sync","--repo=git@github.com:myrepo","--ssh","true","--branch=master","--ssh-known-hosts=false"]
ERROR: can't configure SSH: can't access SSH known_hosts: stat /etc/git-secret/known_hosts: no such file or directory

EDIT - tested different behavior: git-sync effectively ignores the known hosts when passing the corresponding env

docker run -d     \
   -v $DIR:/tmp/git  \
   -v ssh:/etc/git-secret/ssh   \
   -u$(id -u):$(id -g)   \
   -e GIT_KNOWN_HOSTS=false  \
   registry.k8s.io/git-sync/git-sync:v3.6.5         \
   --repo=git@github.com:myrepo \
   --ssh true

but getting reported error when using the flag

docker run -d     \
   -v $DIR:/tmp/git  \
   -v ssh:/etc/git-secret/ssh   \
   -u$(id -u):$(id -g)   \
   registry.k8s.io/git-sync/git-sync:v3.6.5         \
   --repo=git@github.com:myrepo \
   --ssh true   \
   --ssh-known-hosts false

thockin commented 1 year ago

I can't reproduce a failure:

X=/tmp/$RANDOM
rm -rf $X
mkdir -p $X
docker run -ti \
    -v $X:/tmp/git \
    -v ~/.ssh/id_ed25519:/etc/git-secret/ssh:ro \
    -u$(id -u):$(id -g) \
    registry.k8s.io/git-sync/git-sync:v3.6.5 \
        --repo=git@github.com:kubernetes/git-sync \
        --add-user \
        --ssh \
        --ssh-known-hosts=false \
        -v 2
INFO: detected pid 1, running init handler
I0413 19:02:36.314066      12 main.go:401] "level"=0 "msg"="starting up" "pid"=12 "args"=["/git-sync","--repo=git@github.com:kubernetes/git-sync","--add-user","--ssh","--ssh-known-hosts=false","-v","2"]
I0413 19:02:36.336143      12 main.go:1183] "level"=1 "msg"="setting up git SSH credentials"
I0413 19:02:36.336179      12 main.go:539] "level"=1 "msg"="syncing repo"
I0413 19:02:36.336200      12 main.go:950] "level"=0 "msg"="cloning repo" "origin"="git@github.com:kubernetes/git-sync" "path"="/tmp/git"
I0413 19:02:38.765109      12 main.go:760] "level"=0 "msg"="syncing git" "rev"="HEAD" "hash"="bb0128b883a7d8a48d04f9de714803afd24510e9"
I0413 19:02:38.770119      12 main.go:749] "level"=1 "msg"="removing worktree" "path"="/tmp/git/bb0128b883a7d8a48d04f9de714803afd24510e9"
I0413 19:02:38.776632      12 main.go:800] "level"=0 "msg"="adding worktree" "path"="/tmp/git/bb0128b883a7d8a48d04f9de714803afd24510e9" "branch"="origin/master"
I0413 19:02:38.853862      12 main.go:860] "level"=0 "msg"="reset worktree to hash" "path"="/tmp/git/bb0128b883a7d8a48d04f9de714803afd24510e9" "hash"="bb0128b883a7d8a48d04f9de714803afd24510e9"
I0413 19:02:38.853879      12 main.go:865] "level"=0 "msg"="updating submodules"
I0413 19:02:38.884503      12 main.go:717] "level"=1 "msg"="creating tmp symlink" "root"="/tmp/git/" "dst"="bb0128b883a7d8a48d04f9de714803afd24510e9" "src"="tmp-link"
I0413 19:02:38.885478      12 main.go:722] "level"=1 "msg"="renaming symlink" "root"="/tmp/git/" "old_name"="tmp-link" "new_name"="git-sync"
I0413 19:02:38.891484      12 main.go:608] "level"=1 "msg"="next sync" "wait_time"=1000000000

thockin commented 1 year ago

Hmm, changing it to --ssh-known-hosts false (space rather than =) fails. Debugging, that looks like a flag parsing failure.

Use = when setting values on flags, I guess? git-sync v3 uses Go's standard flag parsing library. It looks like even pflag has this oddity - boolean flags don't NORMALLY take an argument, so they only work with =/

yan-hic commented 1 year ago

Thanks @thockin ! I use env eventually so no showstopper but for debugging permission issues (logs not explicit, even verbose), I docker run with entrypoint override to sh, then I like to use git-sync command with flags. Just for my info, is v4 different ?

thockin commented 1 year ago

Unfortunately not. A bool flag is usually used as --boolflag, meaning ["--boolflag=true"], so the flag parser can't REALLY tell whether --boolflag false means ["--boolflag=false"] or ["--boolflag=true", "false"]. Both go flags and pflags chose the latter.

In general, the = syntax is always safe.

kubernetes / git-sync

stderr said "stat /etc/git-secret/known_hosts: no such file or directory" after setting GIT_SYNC_KNOWN_HOSTS=false #694

some irrelevant thought process for people stepping into this post in the future.