iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.9k stars 1.19k forks source link

DVC get/import failed when Git behind a proxy #10563

Closed luhuiguo closed 1 month ago

luhuiguo commented 1 month ago

Bug Report

get/import : Name or service not known

Description

I have a situation where my computer is behind a proxy, and needs to access a Git repository outside of the proxy network. When running dvc get/import behind my proxy, my file is not downloaded and I get the following error: [Errno -2] Name or service not known.

Configure Git to use a proxy

$ git config --global http.proxy http://proxy.address.com:port/
$ git config --global https.proxy http://proxy.address.com:port/

git clone , dvc pull ... Everything is OK

But when I want to download file tracked by DVC into other workspace

$ dvc get GIT_URL PATH
Cloning REPOSITORY.git|███████████████████████████████████████████████████████████████████████████████████████| 
Compressing |234/234 [00:20,    599obj/s]ERROR: unexpected error - HTTPSConnectionPool(host='GITLAB.ADDRESS.COM', port=443): 
Max retries exceeded with url: /REPOSITORY.git/info/refs?service=git-upload-pack (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fd9e93e8190>: Failed to resolve 'GITLAB.ADDRESS.COM' ([Errno -2] Name or service not known)")): HTTPSConnectionPool(host='GITLAB.ADDRESS.COM', port=443): Max retries exceeded with url: REPOSITORY.git/info/refs?service=git-upload-pack (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fd9e93e8190>: Failed to resolve 'GITLAB.ADDRESS.COM' ([Errno -2] Name or service not known)")): <urllib3.connection.HTTPSConnection object at 0x7fd9e93e8190>: Failed to resolve 'GITLAB.ADDRESS.COM' ([Errno -2] Name or service not known): [Errno -2] Name or service not known

Reproduce

dvc get GIT_URL_BEHIND_A_PROXY PATH

Expected

dvc get/import use the git proxy config

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 3.55.1 (deb)
-------------------------
Platform: Python 3.10.8 on Linux-3.10.0-1160.el7.x86_64-x86_64-with-glibc2.39
Subprojects:

Supports:
        azure (adlfs = 2024.7.0, knack = 0.12.0, azure-identity = 1.17.1),
        gdrive (pydrive2 = 1.20.0),
        gs (gcsfs = 2024.6.1),
        hdfs (fsspec = 2024.6.1, pyarrow = 17.0.0),
        http (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
        oss (ossfs = 2023.12.0),
        s3 (s3fs = 2024.6.1, boto3 = 1.35.7),
        ssh (sshfs = 2024.6.0),
        webdav (webdav4 = 0.10.0),
        webdavs (webdav4 = 0.10.0),
        webhdfs (fsspec = 2024.6.1)
Config:
        Global: /home/luhg/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: xfs on /dev/mapper/nvme-data2
Caches: local
Remotes: ssh, local
Workspace directory: xfs on /dev/mapper/nvme-data2
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/c86ae03c82a8ffc47d6317a3b6788f2d

Additional Information (if any):

shcheklein commented 1 month ago

@luhuiguo could you please run it with -v and share the full stack trace. Thanks.

is it the same error if you run it w/o setting up the proxy in Git?

luhuiguo commented 1 month ago

@luhuiguo could you please run it with -v and share the full stack trace. Thanks.

is it the same error if you run it w/o setting up the proxy in Git?

Unset the Git proxy

$ git config unset --global http.proxy
$ git config unset --global https.proxy
$ git config list
filter.lfs.clean=git-lfs clean -- %f
filter.lfs.smudge=git-lfs smudge -- %f
filter.lfs.process=git-lfs filter-process
filter.lfs.required=true

Git clone failed

$ git clone <GIT_REPOSITORY_URL>
Cloning into '<DIRECTORY>'...
fatal: unable to access '<GIT_REPOSITORY_URL>': Could not resolve host: <GITLAB_HOST>

DVC get failed

$ dvc get -v <GIT_REPOSITORY_URL> <PATH>
2024-09-23 10:45:53,687 DEBUG: v3.55.2 (deb), CPython 3.10.8 on Linux-3.10.0-1160.el7.x86_64-x86_64-with-glibc2.39
2024-09-23 10:45:53,687 DEBUG: command: get -v <GIT_REPOSITORY_URL> <PATH>
2024-09-23 10:45:53,946 DEBUG: Creating external repo <GIT_REPOSITORY_URL>@None
2024-09-23 10:45:53,946 DEBUG: erepo: git clone '<GIT_REPOSITORY_URL>' to a temporary dir
2024-09-23 10:46:18,913 ERROR: failed to get '<PATH>' - SCM error: Failed to clone repo '<GIT_REPOSITORY_URL>' to '/tmp/tmpibarss8odvc-clone': HTTPSConnectionPool(host='<GITLAB_HOST>', port=443): Max retries exceeded with url: /<REPOSITORY>/info/refs?service=git-upload-pack (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f0967118070>: Failed to resolve '<GITLAB_HOST>' ([Errno -2] Name or service not known)")): HTTPSConnectionPool(host='<GITLAB_HOST>', port=443): Max retries exceeded with url: /<REPOSITORY>/info/refs?service=git-upload-pack (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f0967118070>: Failed to resolve '<GITLAB_HOST>' ([Errno -2] Name or service not known)")): <urllib3.connection.HTTPSConnection object at 0x7f0967118070>: Failed to resolve '<GITLAB_HOST>' ([Errno -2] Name or service not known): [Errno -2] Name or service not known
Traceback (most recent call last):
  File "urllib3/connection.py", line 196, in _new_conn
  File "urllib3/util/connection.py", line 60, in create_connection
  File "socket.py", line 955, in getaddrinfo
socket.gaierror: [Errno -2] Name or service not known

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "urllib3/connectionpool.py", line 789, in urlopen
  File "urllib3/connectionpool.py", line 490, in _make_request
  File "urllib3/connectionpool.py", line 466, in _make_request
  File "urllib3/connectionpool.py", line 1095, in _validate_conn
  File "urllib3/connection.py", line 615, in connect
  File "urllib3/connection.py", line 203, in _new_conn
urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7f0967118070>: Failed to resolve '<GITLAB_HOST>' ([Errno -2] Name or service not known)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "dulwich/client.py", line 2290, in _http_request
  File "urllib3/_request_methods.py", line 136, in request
  File "urllib3/_request_methods.py", line 183, in request_encode_url
  File "urllib3/poolmanager.py", line 443, in urlopen
  File "urllib3/connectionpool.py", line 873, in urlopen
  File "urllib3/connectionpool.py", line 873, in urlopen
  File "urllib3/connectionpool.py", line 873, in urlopen
  File "urllib3/connectionpool.py", line 843, in urlopen
  File "urllib3/util/retry.py", line 519, in increment
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='<GITLAB_HOST>', port=443): Max retries exceeded with url: /<REPOSITORY>/info/refs?service=git-upload-pack (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f0967118070>: Failed to resolve '<GITLAB_HOST>' ([Errno -2] Name or service not known)"))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "scmrepo/git/backend/dulwich/__init__.py", line 260, in clone
  File "dulwich/porcelain.py", line 546, in clone
  File "dulwich/client.py", line 752, in clone
  File "dulwich/client.py", line 840, in fetch
  File "dulwich/client.py", line 2157, in fetch_pack
  File "dulwich/client.py", line 2013, in _discover_references
  File "scmrepo/git/backend/dulwich/client.py", line 50, in _http_request
  File "dulwich/client.py", line 2298, in _http_request
dulwich.errors.GitProtocolError: HTTPSConnectionPool(host='<GITLAB_HOST>', port=443): Max retries exceeded with url: /<REPOSITORY>/info/refs?service=git-upload-pack (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f0967118070>: Failed to resolve '<GITLAB_HOST>' ([Errno -2] Name or service not known)"))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "dvc/scm.py", line 150, in clone
  File "scmrepo/git/__init__.py", line 154, in clone
  File "scmrepo/git/backend/dulwich/__init__.py", line 268, in clone
scmrepo.exceptions.CloneError: Failed to clone repo '<GIT_REPOSITORY_URL>' to '/tmp/tmpibarss8odvc-clone'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "dvc/commands/get.py", line 37, in _get_file_from_repo
  File "dvc/repo/get.py", line 45, in get
  File "dvc/repo/__init__.py", line 302, in open
  File "dvc/repo/open_repo.py", line 60, in open_repo
  File "contextlib.py", line 79, in inner
  File "dvc/repo/open_repo.py", line 23, in _external_repo
  File "dvc/repo/open_repo.py", line 134, in _cached_clone
  File "funcy/decorators.py", line 47, in wrapper
  File "funcy/flow.py", line 246, in wrap_with
  File "funcy/decorators.py", line 68, in __call__
  File "dvc/repo/open_repo.py", line 198, in _clone_default_branch
  File "dvc/scm.py", line 155, in clone
dvc.scm.CloneError: SCM error

2024-09-23 10:46:18,950 DEBUG: Analytics is enabled.
2024-09-23 10:46:18,952 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmpggwvzttp', '-v']
2024-09-23 10:46:18,962 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmpggwvzttp', '-v'] with pid 201

Configure Git to use a proxy

$ git config --global http.proxy http://10.3.12.8:3128
$ git config --global https.proxy http://10.3.12.8:3128
$ git config list
filter.lfs.clean=git-lfs clean -- %f
filter.lfs.smudge=git-lfs smudge -- %f
filter.lfs.process=git-lfs filter-process
filter.lfs.required=true
http.proxy=http://10.3.12.8:3128
https.proxy=http://10.3.12.8:3128

GIT clone successful

$ git clone <GIT_REPOSITORY_URL>
Cloning into '<DIRECTORY>'...
remote: Enumerating objects: 166, done.
remote: Counting objects: 100% (133/133), done.
remote: Compressing objects: 100% (119/119), done.
remote: Total 166 (delta 42), reused 0 (delta 0), pack-reused 33
Receiving objects: 100% (166/166), 11.08 MiB | 874.00 KiB/s, done.
Resolving deltas: 100% (48/48), done.

DVC get failed

$ dvc get -v <GIT_REPOSITORY_URL> <PATH>
2024-09-23 10:40:27,336 DEBUG: v3.55.2 (deb), CPython 3.10.8 on Linux-3.10.0-1160.el7.x86_64-x86_64-with-glibc2.39
2024-09-23 10:40:27,336 DEBUG: command: get -v <GIT_REPOSITORY_URL> <PATH>
2024-09-23 10:40:27,486 DEBUG: Creating external repo <GIT_REPOSITORY_URL>@None
2024-09-23 10:40:27,486 DEBUG: erepo: git clone '<GIT_REPOSITORY_URL>' to a temporary dir
2024-09-23 10:41:06,026 ERROR: unexpected error - HTTPSConnectionPool(host='<GITLAB_HOST>', port=443): Max retries exceeded with url: /<REPOSITORY>/info/refs?service=git-upload-pack (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7ff140e4c040>: Failed to resolve '<GITLAB_HOST>' ([Errno -2] Name or service not known)")): HTTPSConnectionPool(host='<GITLAB_HOST>', port=443): Max retries exceeded with url: /<REPOSITORY>/info/refs?service=git-upload-pack (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7ff140e4c040>: Failed to resolve '<GITLAB_HOST>' ([Errno -2] Name or service not known)")): <urllib3.connection.HTTPSConnection object at 0x7ff140e4c040>: Failed to resolve '<GITLAB_HOST>' ([Errno -2] Name or service not known): [Errno -2] Name or service not known
Traceback (most recent call last):
  File "urllib3/connection.py", line 196, in _new_conn
  File "urllib3/util/connection.py", line 60, in create_connection
  File "socket.py", line 955, in getaddrinfo
socket.gaierror: [Errno -2] Name or service not known

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "urllib3/connectionpool.py", line 789, in urlopen
  File "urllib3/connectionpool.py", line 490, in _make_request
  File "urllib3/connectionpool.py", line 466, in _make_request
  File "urllib3/connectionpool.py", line 1095, in _validate_conn
  File "urllib3/connection.py", line 615, in connect
  File "urllib3/connection.py", line 203, in _new_conn
urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7ff140e4c040>: Failed to resolve '<GITLAB_HOST>' ([Errno -2] Name or service not known)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "dulwich/client.py", line 2290, in _http_request
  File "urllib3/_request_methods.py", line 136, in request
  File "urllib3/_request_methods.py", line 183, in request_encode_url
  File "urllib3/poolmanager.py", line 443, in urlopen
  File "urllib3/connectionpool.py", line 873, in urlopen
  File "urllib3/connectionpool.py", line 873, in urlopen
  File "urllib3/connectionpool.py", line 873, in urlopen
  File "urllib3/connectionpool.py", line 843, in urlopen
  File "urllib3/util/retry.py", line 519, in increment
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='<GITLAB_HOST>', port=443): Max retries exceeded with url: /<REPOSITORY>/info/refs?service=git-upload-pack (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7ff140e4c040>: Failed to resolve '<GITLAB_HOST>' ([Errno -2] Name or service not known)"))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "dvc/cli/__init__.py", line 211, in main
  File "dvc/cli/command.py", line 41, in do_run
  File "dvc/commands/get.py", line 30, in run
  File "dvc/commands/get.py", line 37, in _get_file_from_repo
  File "dvc/repo/get.py", line 45, in get
  File "dvc/repo/__init__.py", line 302, in open
  File "dvc/repo/open_repo.py", line 60, in open_repo
  File "contextlib.py", line 79, in inner
  File "dvc/repo/open_repo.py", line 23, in _external_repo
  File "dvc/repo/open_repo.py", line 134, in _cached_clone
  File "funcy/decorators.py", line 47, in wrapper
  File "funcy/flow.py", line 246, in wrap_with
  File "funcy/decorators.py", line 68, in __call__
  File "dvc/repo/open_repo.py", line 198, in _clone_default_branch
  File "dvc/scm.py", line 152, in clone
  File "dvc/repo/experiments/utils.py", line 275, in fetch_all_exps
  File "dvc/repo/experiments/utils.py", line 275, in <listcomp>
  File "dvc/repo/experiments/utils.py", line 119, in iter_remote_refs
  File "scmrepo/git/backend/dulwich/__init__.py", line 590, in iter_remote_refs
  File "dulwich/client.py", line 2208, in get_refs
  File "dulwich/client.py", line 2013, in _discover_references
  File "scmrepo/git/backend/dulwich/client.py", line 50, in _http_request
  File "dulwich/client.py", line 2298, in _http_request
dulwich.errors.GitProtocolError: HTTPSConnectionPool(host='<GITLAB_HOST>', port=443): Max retries exceeded with url: /<REPOSITORY>/info/refs?service=git-upload-pack (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7ff140e4c040>: Failed to resolve '<GITLAB_HOST>' ([Errno -2] Name or service not known)"))

2024-09-23 10:41:06,128 DEBUG: Version info for developers:
DVC version: 3.55.2 (deb)
-------------------------
Platform: Python 3.10.8 on Linux-3.10.0-1160.el7.x86_64-x86_64-with-glibc2.39
Subprojects:

Supports:
        azure (adlfs = 2024.7.0, knack = 0.12.0, azure-identity = 1.17.1),
        gdrive (pydrive2 = 1.20.0),
        gs (gcsfs = 2024.6.1),
        hdfs (fsspec = 2024.6.1, pyarrow = 17.0.0),
        http (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
        oss (ossfs = 2023.12.0),
        s3 (s3fs = 2024.6.1, boto3 = 1.35.7),
        ssh (sshfs = 2024.6.0),
        webdav (webdav4 = 0.10.0),
        webdavs (webdav4 = 0.10.0),
        webhdfs (fsspec = 2024.6.1)
Config:
        Global: /home/luhg/.config/dvc
        System: /etc/xdg/dvc

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2024-09-23 10:41:06,145 DEBUG: Analytics is enabled.
2024-09-23 10:41:06,147 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmp8y57xigq', '-v']
2024-09-23 10:41:06,155 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmp8y57xigq', '-v'] with pid 174
shcheklein commented 1 month ago

Okay, probably it should be fixed on the https://github.com/jelmer/dulwich side.

Is there a way for you to run with HTTP_PROXY and HTTPS_PROXY env vars set? I think dulwich supports those.

Probably you can create an alias for now for dvc get to include those env vars. Would that work for you for now?

shcheklein commented 1 month ago

I've created an issue upstream https://github.com/jelmer/dulwich/issues/1368

shcheklein commented 1 month ago

Okay, seems it (the proxy via global Git config) should be supported. I've tried to do this:

(.venv) √ Projects/test-dvc-get % dvc import https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml
ERROR: failed to import 'get-started/data.xml' from 'https://github.com/iterative/dataset-registry'. - stage working dir '/Users/ivan/Projects/test-dvc-get/data' does not exist
(.venv) ?1 Projects/test-dvc-get % mkdir data
(.venv) √ Projects/test-dvc-get % dvc import https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml
Importing 'get-started/data.xml (https://github.com/iterative/dataset-registry)' -> 'data/data.xml'
ERROR: failed to import 'get-started/data.xml' - SCM error: Failed to clone repo 'https://github.com/iterative/dataset-registry' to '/var/folders/8f/fbysfztx1mb953_gpwl477p80000gn/T/tmphwwt1qxrdvc-clone': HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /iterative/dataset-registry/info/refs?service=git-upload-pack (Caused by ProxyError('Unable to connect to proxy', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x105ce6330>: Failed to establish a new connection: [Errno 51] Network is unreachable'))): HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /iterative/dataset-registry/info/refs?service=git-upload-pack (Caused by ProxyError('Unable to connect to proxy', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x105ce6330>: Failed to establish a new connection: [Errno 51] Network is unreachable'))): ('Unable to connect to proxy', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x105ce6330>: Failed to establish a new connection: [Errno 51] Network is unreachable')): <urllib3.connection.HTTPSConnection object at 0x105ce6330>: Failed to establish a new connection: [Errno 51] Network is unreachable: [Errno 51] Network is unreachable

after running:

$ git config --global http.proxy http://10.3.12.8:3128
$ git config --global https.proxy http://10.3.12.8:3128

So, it's trying to connect to proxy (and fails).

We need a simpler way to reproduce this to research - e.g. some way to run a local proxy to do some experiments.

luhuiguo commented 1 month ago

Is it a problem related to domain name resolution?

my error message:

[Errno -2] Name or service not known

and your message:

[Errno 51] Network is unreachable

We deploy a self-managed GitLab instance in the company intranet and use the company's intranet domain name resolution.

The gitlab hostname is not resolvable outside our company intranet.

on my PC

ping <GITLAB_HOSTNAME>
ping: <GITLAB_HOSTNAME>: Name or service not known

On proxy server:

ping <GITLAB_HOSTNAME>
PING <GITLAB_HOSTNAME> (192.168.57.131) 56(84) bytes of data.
64 bytes from<GITLAB_HOSTNAME> (192.168.57.131): icmp_seq=1 ttl=59 time=3.08 ms

Change the domain name in the git url to the IP address

dvc get <GITLAB_URL_WITH_IPADDRESS> <PATH> 
ERROR: failed to get '<PATH>' - SCM error: Failed to clone repo '<GITLAB_URL_WITH_IP>' to '/tmp/tmp3u5l98ondvc-clone': HTTPSConnectionPool(host='192.168.57.131', port=443): Max retries exceeded with url: /<REPOSITORY>/info/refs?service=git-upload-pack (Caused by SSLError(SSLCertVerificationError(1, "[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: IP address mismatch, certificate is not valid for '192.168.57.131'. (_ssl.c:997)"))): HTTPSConnectionPool(host='192.168.57.131', port=443): Max retries exceeded with url: /<REPOSITORY>/info/refs?service=git-upload-pack (Caused by SSLError(SSLCertVerificationError(1, "[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: IP address mismatch, certificate is not valid for '192.168.57.131'. (_ssl.c:997)"))): [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: IP address mismatch, certificate is not valid for '192.168.57.131'. (_ssl.c:997)

It seems that I can connect to the gitlab server, but the IP address of the certificate does not match.

shcheklein commented 1 month ago

Yes, it seems so, but it's hard to tell why is it trying to resolve it on the machine outside proxy.

You can try to add hostname to the /etc/hostname as a workaround?

Otherwise we need a simple setup (some local) proxy to reproduce this.

luhuiguo commented 1 month ago

I still think there are some special cases where the proxy doesn't work

I run DVC with Docker.

$ docker run -it --rm  -v ${PWD}:/workspace  <DVC_IMAGE> bash

Dockerfile

FROM ubuntu:24.04
RUN apt update && apt install -y gpg curl wget software-properties-common iputils-ping
RUN add-apt-repository -y ppa:git-core/ppa && apt update && apt install -y git
RUN git config --global user.email "<USER_EMAIL>" && git config --global user.name "<USER_NAME>"
RUN curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
RUN apt update && apt install -y git-lfs && git lfs install
RUN wget https://dvc.org/deb/dvc.list -O /etc/apt/sources.list.d/dvc.list && \
    wget -qO - https://dvc.org/deb/iterative.asc | gpg --dearmor > packages.iterative.gpg && \
    install -o root -g root -m 644 packages.iterative.gpg /etc/apt/trusted.gpg.d/ && \
    rm -f packages.iterative.gpg
RUN apt update && apt install -y dvc
RUN mkdir -p /workspace
WORKDIR /workspace

Everything works fine on other computers, But on this server, Due to some network configuration reasons, the server can not access the gitlab server.

At first,Gitlab hostname is not resolvable and the gitlab host is unreachable

$ ping <GITLAB_HOSTNAME>
ping: <GITLAB_HOSTNAME>: Name or service not known
$ ping 192.168.57.131
PING 192.168.57.131 (192.168.57.131) 56(84) bytes of data.

--- 192.168.57.131 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

$ curl <GITLAB_REPOSITORY_URL>
curl: (6) Could not resolve host: <GITLAB_HOSTNAME>

$ git clone <GIT_REPOSITORY_URL>
Cloning into '<DIRECTORY>'...
fatal: unable to access '<GIT_REPOSITORY_URL>': Could not resolve host: <GITLAB_HOSTNAME>

Configure Git to use a proxy

$ git config --global http.proxy http://10.3.12.8:3128
$ git config --global https.proxy http://10.3.12.8:3128

Git can clone the repository

git clone <GITLAB_REPOSITORY_URL
Cloning into '<REPOSITORY>'...
remote: Enumerating objects: 166, done.
remote: Counting objects: 100% (133/133), done.
remote: Compressing objects: 100% (119/119), done.
remote: Total 166 (delta 42), reused 0 (delta 0), pack-reused 33
Receiving objects: 100% (166/166), 11.08 MiB | 871.00 KiB/s, done.
Resolving deltas: 100% (48/48), done.

Configure proxy environment variable

$ curl <GITLAB_REPOSITORY_URL>
curl: (6) Could not resolve host: <GITLAB_HOSTNAME>
$ export HTTP_PROXY=http://10.3.12.8:3128
$ export HTTPS_PROXY=http://10.3.12.8:3128

CURL can access

$ curl <GITLAB_REPOSITORY_URL>
<html><body>You are being <a href="https://<GITLAB_HOSTNAME>/users/sign_in">redirected</a>.</body></html>

But can not get the file tracked by DVC

$ dvc get <GITLAB_REPOSITORY_URL> <PATH>
ERROR: unexpected error - HTTPSConnectionPool(host='<GITLAB_HOSTNAME>', port=443): Max retries exceeded with url: /JYAI/data-registry/info/refs?service=git-upload-pack (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f85816334c0>: Failed to resolve '<GITLAB_HOSTNAME>' ([Errno -2] Name or service not known)")): HTTPSConnectionPool(host='<GITLAB_HOSTNAME>', port=443): Max retries exceeded with url: /<REPOSITORY>/info/refs?service=git-upload-pack (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f85816334c0>: Failed to resolve '<GITLAB_HOSTNAME>' ([Errno -2] Name or service not known)")): <urllib3.connection.HTTPSConnection object at 0x7f85816334c0>: Failed to resolve '<GITLAB_HOSTNAME>' ([Errno -2] Name or service not known): [Errno -2] Name or service not known

Add hostname to the /etc/hostname

$ echo "192.168.57.131 <GITLAB_HOSTNAME>">> /etc/hosts
$ ping <GITLAB_HOSTNAME>
PING <GITLAB_HOSTNAME> (192.168.57.131) 56(84) bytes of data.

Still not working

dvc get -v <GITLAB_REPOSITORY_URL> <PATH>
2024-09-25 12:52:05,038 DEBUG: v3.55.2 (deb), CPython 3.10.8 on Linux-3.10.0-1160.el7.x86_64-x86_64-with-glibc2.39
2024-09-25 12:52:05,038 DEBUG: command: get -v <GITLAB_REPOSITORY_URL> <PATH>
2024-09-25 12:52:05,187 DEBUG: Creating external repo <GITLAB_REPOSITORY_URL>@None
2024-09-25 12:52:05,187 DEBUG: erepo: git clone '<GITLAB_REPOSITORY_URL>' to a temporary dir
Cloning data-registry.git|█████████████████████████████████████████████████████████████████████████████████████████| Compressing |119/119 [00:00,   3.01obj/s]2024-09-25 13:00:47,786 ERROR: unexpected error - HTTPSConnectionPool(host='<GITLAB_HOSTNAME>', port=443): Max retries exceeded with url: /<REPOSITORY>/info/refs?service=git-upload-pack (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fc791d8f5b0>, 'Connection to <GITLAB_HOSTNAME> timed out. (connect timeout=None)')): HTTPSConnectionPool(host='<GITLAB_HOSTNAME>', port=443): Max retries exceeded with url: /<REPOSITORY>/info/refs?service=git-upload-pack (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fc791d8f5b0>, 'Connection to <GITLAB_HOSTNAME> timed out. (connect timeout=None)')): (<urllib3.connection.HTTPSConnection object at 0x7fc791d8f5b0>, 'Connection to <GITLAB_HOSTNAME> timed out. (connect timeout=None)'): [Errno 110] Connection timed out
Traceback (most recent call last):
  File "urllib3/connection.py", line 199, in _new_conn
  File "urllib3/util/connection.py", line 85, in create_connection
  File "urllib3/util/connection.py", line 73, in create_connection
TimeoutError: [Errno 110] Connection timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "urllib3/connectionpool.py", line 789, in urlopen
  File "urllib3/connectionpool.py", line 490, in _make_request
  File "urllib3/connectionpool.py", line 466, in _make_request
  File "urllib3/connectionpool.py", line 1095, in _validate_conn
  File "urllib3/connection.py", line 693, in connect
  File "urllib3/connection.py", line 208, in _new_conn
urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPSConnection object at 0x7fc791d8f5b0>, 'Connection to <GITLAB_HOSTNAME> timed out. (connect timeout=None)')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "dulwich/client.py", line 2290, in _http_request
  File "urllib3/_request_methods.py", line 135, in request
  File "urllib3/_request_methods.py", line 182, in request_encode_url
  File "urllib3/poolmanager.py", line 443, in urlopen
  File "urllib3/connectionpool.py", line 873, in urlopen
  File "urllib3/connectionpool.py", line 873, in urlopen
  File "urllib3/connectionpool.py", line 873, in urlopen
  File "urllib3/connectionpool.py", line 843, in urlopen
  File "urllib3/util/retry.py", line 519, in increment
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='<GITLAB_HOSTNAME>', port=443): Max retries exceeded with url: /<REPOSITORY>/info/refs?service=git-upload-pack (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fc791d8f5b0>, 'Connection to <GITLAB_HOSTNAME> timed out. (connect timeout=None)'))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "dvc/cli/__init__.py", line 211, in main
  File "dvc/cli/command.py", line 41, in do_run
  File "dvc/commands/get.py", line 30, in run
  File "dvc/commands/get.py", line 37, in _get_file_from_repo
  File "dvc/repo/get.py", line 45, in get
  File "dvc/repo/__init__.py", line 302, in open
  File "dvc/repo/open_repo.py", line 60, in open_repo
  File "contextlib.py", line 79, in inner
  File "dvc/repo/open_repo.py", line 23, in _external_repo
  File "dvc/repo/open_repo.py", line 134, in _cached_clone
  File "funcy/decorators.py", line 47, in wrapper
  File "funcy/flow.py", line 246, in wrap_with
  File "funcy/decorators.py", line 68, in __call__
  File "dvc/repo/open_repo.py", line 198, in _clone_default_branch
  File "dvc/scm.py", line 152, in clone
  File "dvc/repo/experiments/utils.py", line 275, in fetch_all_exps
  File "dvc/repo/experiments/utils.py", line 275, in <listcomp>
  File "dvc/repo/experiments/utils.py", line 119, in iter_remote_refs
  File "scmrepo/git/backend/dulwich/__init__.py", line 590, in iter_remote_refs
  File "dulwich/client.py", line 2208, in get_refs
  File "dulwich/client.py", line 2013, in _discover_references
  File "scmrepo/git/backend/dulwich/client.py", line 50, in _http_request
  File "dulwich/client.py", line 2298, in _http_request
dulwich.errors.GitProtocolError: HTTPSConnectionPool(host='<GITLAB_HOSTNAME>', port=443): Max retries exceeded with url: /<REPOSITORY>/info/refs?service=git-upload-pack (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fc791d8f5b0>, 'Connection to <GITLAB_HOSTNAME> timed out. (connect timeout=None)'))

2024-09-25 13:00:47,888 DEBUG: Version info for developers:
DVC version: 3.55.2 (deb)
-------------------------
Platform: Python 3.10.8 on Linux-3.10.0-1160.el7.x86_64-x86_64-with-glibc2.39
Subprojects:

Supports:
        azure (adlfs = 2024.7.0, knack = 0.12.0, azure-identity = 1.18.0),
        gdrive (pydrive2 = 1.20.0),
        gs (gcsfs = 2024.9.0.post1),
        hdfs (fsspec = 2024.9.0, pyarrow = 17.0.0),
        http (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
        oss (ossfs = 2023.12.0),
        s3 (s3fs = 2024.9.0, boto3 = 1.35.23),
        ssh (sshfs = 2024.6.0),
        webdav (webdav4 = 0.10.0),
        webdavs (webdav4 = 0.10.0),
        webhdfs (fsspec = 2024.9.0)
Config:
        Global: /root/.config/dvc
        System: /etc/xdg/dvc
shcheklein commented 1 month ago

I still think there are some special cases where the proxy doesn't work

Yes, right. Or it still pick it up, but for whatever reason is trying to resolve the hostname while it should be doing that on the proxy machine (?).

Still not working

🤔

I really need a way to reproduce this locally. Then I'm pretty sure I can find the reason faster. If you have some idea how to run a proxy on my machine to experiment with it - that would help a lot.

luhuiguo commented 1 month ago

Reproduce

Start a proxy server

$ docker run -d --name squid-container -e TZ=UTC -p 3128:3128 ubuntu/squid

Start a DVC container

$ docker run -it --rm luhuiguo/dvc bash

DVC get successfully

root@bd4ec17f398c:/workspace# dvc get -v https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml
2024-09-26 12:25:27,589 DEBUG: v3.55.2 (deb), CPython 3.10.8 on Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.39
2024-09-26 12:25:27,589 DEBUG: command: get -v https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml
2024-09-26 12:25:27,697 DEBUG: Creating external repo https://github.com/iterative/dataset-registry@None
2024-09-26 12:25:27,697 DEBUG: erepo: git clone 'https://github.com/iterative/dataset-registry' to a temporary dir
2024-09-26 12:25:49,445 DEBUG: Analytics is enabled.
2024-09-26 12:25:49,446 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmpl2zikt7c', '-v']
2024-09-26 12:25:49,452 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmpl2zikt7c', '-v'] with pid 234
2024-09-26 12:25:49,454 DEBUG: Removing '/tmp/tmpb241nh80dvc-clone'
2024-09-26 12:25:49,457 DEBUG: Removing '/tmp/tmpu6ugkwyrdvc-cache'

root@bd4ec17f398c:/workspace# rm -rf data

Block github.com hostname

root@bd4ec17f398c:/workspace# echo "127.0.0.1 github.com">> /etc/hosts
root@bd4ec17f398c:/workspace# cat /etc/hosts
127.0.0.1       localhost
::1     localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
172.17.0.3      bd4ec17f398c
127.0.0.1 github.com

GIT clone and DVC get failed

root@bd4ec17f398c:/workspace# git clone -v https://github.com/iterative/dataset-registry
Cloning into 'dataset-registry'...
fatal: unable to access 'https://github.com/iterative/dataset-registry/': Failed to connect to github.com port 443 after 0 ms: Couldn't connect to server
root@bd4ec17f398c:/workspace# dvc get https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml
ERROR: failed to get 'get-started/data.xml' - SCM error: Failed to clone repo 'https://github.com/iterative/dataset-registry' to '/tmp/tmpiagkrm4pdvc-clone': HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /iterative/dataset-registry/info/refs?service=git-upload-pack (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f23ea1fc400>: Failed to establish a new connection: [Errno 111] Connection refused')): HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /iterative/dataset-registry/info/refs?service=git-upload-pack (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f23ea1fc400>: Failed to establish a new connection: [Errno 111] Connection refused')): <urllib3.connection.HTTPSConnection object at 0x7f23ea1fc400>: Failed to establish a new connection: [Errno 111] Connection refused: [Errno 111] Connection refused

Configure Git to use a proxy

root@bd4ec17f398c:/workspace# git config --global http.proxy http://10.3.12.8:3128
root@bd4ec17f398c:/workspace# git config --global https.proxy http://10.3.12.8:3128
root@bd4ec17f398c:/workspace# git config list
filter.lfs.smudge=git-lfs smudge -- %f
filter.lfs.process=git-lfs filter-process
filter.lfs.required=true
filter.lfs.clean=git-lfs clean -- %f
user.email=luhuiguo@gmail.com
user.name=luhuiguo
filter.lfs.clean=git-lfs clean -- %f
filter.lfs.smudge=git-lfs smudge -- %f
filter.lfs.process=git-lfs filter-process
filter.lfs.required=true
http.proxy=http://10.3.12.8:3128
https.proxy=http://10.3.12.8:3128

GIT clone successfully

root@bd4ec17f398c:/workspace# git clone -v https://github.com/iterative/dataset-registry
Cloning into 'dataset-registry'...
POST git-upload-pack (175 bytes)
POST git-upload-pack (gzip 1202 to 636 bytes)
remote: Enumerating objects: 328, done.
remote: Counting objects: 100% (123/123), done.
remote: Compressing objects: 100% (84/84), done.
remote: Total 328 (delta 53), reused 61 (delta 38), pack-reused 205 (from 1)
Receiving objects: 100% (328/328), 50.37 KiB | 606.00 KiB/s, done.
Resolving deltas: 100% (85/85), done.

DVC get failed

root@bd4ec17f398c:/workspace# dvc get -v https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml
2024-09-26 12:39:41,732 DEBUG: v3.55.2 (deb), CPython 3.10.8 on Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.39
2024-09-26 12:39:41,732 DEBUG: command: get -v https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml
2024-09-26 12:39:41,837 DEBUG: Creating external repo https://github.com/iterative/dataset-registry@None
2024-09-26 12:39:41,837 DEBUG: erepo: git clone 'https://github.com/iterative/dataset-registry' to a temporary dir
2024-09-26 12:39:43,436 ERROR: unexpected error - HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /iterative/dataset-registry/info/refs?service=git-upload-pack (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f86763ab4f0>: Failed to establish a new connection: [Errno 111] Connection refused')): HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /iterative/dataset-registry/info/refs?service=git-upload-pack (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f86763ab4f0>: Failed to establish a new connection: [Errno 111] Connection refused')): <urllib3.connection.HTTPSConnection object at 0x7f86763ab4f0>: Failed to establish a new connection: [Errno 111] Connection refused: [Errno 111] Connection refused
Traceback (most recent call last):
  File "urllib3/connection.py", line 199, in _new_conn
  File "urllib3/util/connection.py", line 85, in create_connection
  File "urllib3/util/connection.py", line 73, in create_connection
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "urllib3/connectionpool.py", line 789, in urlopen
  File "urllib3/connectionpool.py", line 490, in _make_request
  File "urllib3/connectionpool.py", line 466, in _make_request
  File "urllib3/connectionpool.py", line 1095, in _validate_conn
  File "urllib3/connection.py", line 693, in connect
  File "urllib3/connection.py", line 214, in _new_conn
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f86763ab4f0>: Failed to establish a new connection: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "dulwich/client.py", line 2290, in _http_request
  File "urllib3/_request_methods.py", line 135, in request
  File "urllib3/_request_methods.py", line 182, in request_encode_url
  File "urllib3/poolmanager.py", line 443, in urlopen
  File "urllib3/connectionpool.py", line 873, in urlopen
  File "urllib3/connectionpool.py", line 873, in urlopen
  File "urllib3/connectionpool.py", line 873, in urlopen
  File "urllib3/connectionpool.py", line 843, in urlopen
  File "urllib3/util/retry.py", line 519, in increment
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /iterative/dataset-registry/info/refs?service=git-upload-pack (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f86763ab4f0>: Failed to establish a new connection: [Errno 111] Connection refused'))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "dvc/cli/__init__.py", line 211, in main
  File "dvc/cli/command.py", line 41, in do_run
  File "dvc/commands/get.py", line 30, in run
  File "dvc/commands/get.py", line 37, in _get_file_from_repo
  File "dvc/repo/get.py", line 45, in get
  File "dvc/repo/__init__.py", line 302, in open
  File "dvc/repo/open_repo.py", line 60, in open_repo
  File "contextlib.py", line 79, in inner
  File "dvc/repo/open_repo.py", line 23, in _external_repo
  File "dvc/repo/open_repo.py", line 134, in _cached_clone
  File "funcy/decorators.py", line 47, in wrapper
  File "funcy/flow.py", line 246, in wrap_with
  File "funcy/decorators.py", line 68, in __call__
  File "dvc/repo/open_repo.py", line 198, in _clone_default_branch
  File "dvc/scm.py", line 152, in clone
  File "dvc/repo/experiments/utils.py", line 275, in fetch_all_exps
  File "dvc/repo/experiments/utils.py", line 275, in <listcomp>
  File "dvc/repo/experiments/utils.py", line 119, in iter_remote_refs
  File "scmrepo/git/backend/dulwich/__init__.py", line 590, in iter_remote_refs
  File "dulwich/client.py", line 2208, in get_refs
  File "dulwich/client.py", line 2013, in _discover_references
  File "scmrepo/git/backend/dulwich/client.py", line 50, in _http_request
  File "dulwich/client.py", line 2298, in _http_request
dulwich.errors.GitProtocolError: HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /iterative/dataset-registry/info/refs?service=git-upload-pack (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f86763ab4f0>: Failed to establish a new connection: [Errno 111] Connection refused'))

2024-09-26 12:39:43,464 DEBUG: Version info for developers:
DVC version: 3.55.2 (deb)
-------------------------
Platform: Python 3.10.8 on Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.39
Subprojects:

Supports:
        azure (adlfs = 2024.7.0, knack = 0.12.0, azure-identity = 1.18.0),
        gdrive (pydrive2 = 1.20.0),
        gs (gcsfs = 2024.9.0.post1),
        hdfs (fsspec = 2024.9.0, pyarrow = 17.0.0),
        http (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
        oss (ossfs = 2023.12.0),
        s3 (s3fs = 2024.9.0, boto3 = 1.35.23),
        ssh (sshfs = 2024.6.0),
        webdav (webdav4 = 0.10.0),
        webdavs (webdav4 = 0.10.0),
        webhdfs (fsspec = 2024.9.0)
Config:
        Global: /root/.config/dvc
        System: /etc/xdg/dvc

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2024-09-26 12:39:43,469 DEBUG: Analytics is enabled.
2024-09-26 12:39:43,470 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmpk4n384y7', '-v']
2024-09-26 12:39:43,475 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmpk4n384y7', '-v'] with pid 342

Unblock github.com

root@bd4ec17f398c:/workspace# echo "$(sed '/github.com/d' /etc/hosts)" > /etc/hosts
root@bd4ec17f398c:/workspace# cat /etc/hosts
127.0.0.1       localhost
::1     localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
172.17.0.3      bd4ec17f398c
root@bd4ec17f398c:/workspace# git config list
filter.lfs.smudge=git-lfs smudge -- %f
filter.lfs.process=git-lfs filter-process
filter.lfs.required=true
filter.lfs.clean=git-lfs clean -- %f
user.email=luhuiguo@gmail.com
user.name=luhuiguo
filter.lfs.clean=git-lfs clean -- %f
filter.lfs.smudge=git-lfs smudge -- %f
filter.lfs.process=git-lfs filter-process
filter.lfs.required=true
http.proxy=http://10.3.12.8:3128
https.proxy=http://10.3.12.8:3128

DVC get successfully

root@bd4ec17f398c:/workspace# dvc get -v https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml
2024-09-26 12:47:29,915 DEBUG: v3.55.2 (deb), CPython 3.10.8 on Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.39
2024-09-26 12:47:29,915 DEBUG: command: get -v https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml
2024-09-26 12:47:30,025 DEBUG: Creating external repo https://github.com/iterative/dataset-registry@None
2024-09-26 12:47:30,025 DEBUG: erepo: git clone 'https://github.com/iterative/dataset-registry' to a temporary dir
2024-09-26 12:50:50,305 DEBUG: Analytics is enabled.
2024-09-26 12:50:50,305 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmp29_f6y8i', '-v']
2024-09-26 12:50:50,311 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmp29_f6y8i', '-v'] with pid 516
2024-09-26 12:50:50,312 DEBUG: Removing '/tmp/tmp_0xv8ud8dvc-clone'
2024-09-26 12:50:50,314 DEBUG: Removing '/tmp/tmp47ujcmg0dvc-cache'
shcheklein commented 1 month ago

@luhuiguo could you try to install scmrepo from this branch https://github.com/iterative/scmrepo/pull/378 and do some experiments

thanks for the reproducible env!

luhuiguo commented 1 month ago

It works

Install scmrepo from branch fix-fetch-exps-under-proxy

$ docker run -it --rm python bash
$ root@fcf01db756c8:/# pip install dvc
$ pip install git+https://github.com/iterative/scmrepo.git@fix-fetch-exps-under-proxy
Collecting git+https://github.com/iterative/scmrepo.git@fix-fetch-exps-under-proxy
  Cloning https://github.com/iterative/scmrepo.git (to revision fix-fetch-exps-under-proxy) to /tmp/pip-req-build-v2fbye8o
.......
Successfully built scmrepo
Installing collected packages: scmrepo
  Attempting uninstall: scmrepo
    Found existing installation: scmrepo 3.3.7
    Uninstalling scmrepo-3.3.7:
      Successfully uninstalled scmrepo-3.3.7
Successfully installed scmrepo-3.3.8.dev4+gf2e18e2

Block github.com and use git proxy config

root@fcf01db756c8:/# echo "127.0.0.1 github.com">> /etc/hosts
root@fcf01db756c8:/# git config --global http.proxy http://10.3.12.8:3128
root@fcf01db756c8:/# git config --global https.proxy http://10.3.12.8:3128
root@fcf01db756c8:/# dvc get -v https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml
2024-09-27 01:48:59,097 DEBUG: v3.55.2 (pip), CPython 3.12.6 on Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.36
2024-09-27 01:48:59,097 DEBUG: command: /usr/local/bin/dvc get -v https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml
2024-09-27 01:48:59,285 DEBUG: Creating external repo https://github.com/iterative/dataset-registry@None
2024-09-27 01:48:59,285 DEBUG: erepo: git clone 'https://github.com/iterative/dataset-registry' to a temporary dir
2024-09-27 01:49:09,292 DEBUG: Analytics is enabled.
2024-09-27 01:49:09,323 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmp5dtejb1u', '-v']
2024-09-27 01:49:09,328 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmp5dtejb1u', '-v'] with pid 219
2024-09-27 01:49:09,330 DEBUG: Removing '/tmp/tmpbjw1dcxjdvc-clone'
2024-09-27 01:49:09,333 DEBUG: Removing '/tmp/tmp6mr_z8rfdvc-cache'
shcheklein commented 1 month ago

Okay, good. I'll try to get to it to add tests and release asap. Thanks for your help reproducing this.