iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.66k stars 1.17k forks source link

Pull: Doesn't work with GET redirects (JFrog Artifactory) #10454

Closed seakros closed 3 months ago

seakros commented 3 months ago

Bug Report

Description

DVC fails to pull files greater than 200KB in size when using JFrog Artifactory as the remote storage backend. The issue seems to be related to how DVC handles redirects from JFrog when attempting to download larger files.

Reproduce

  1. Set up JFrog Artifactory as the DVC remote storage backend with the following configuration in dvc.config:
    [core]
    remote = artifactory
    autostage = true
    ['remote "artifactory"']
    url = https://jfrog.io/artifactory/dvc_store
    method = PUT
    auth = basic
    user = <username>
    password = <authentication_token>
    custom_auth_header = X-JFrog-Art-Api:<API_KEY>
  2. Create two files using dd, one with a size of 199KB and another with a size of 200KB.
  3. Run dvc add and dvc push to add and push the files to JFrog Artifactory.
  4. Commit the changes and push to the git repository.
  5. Clone the repository in a new location.
  6. Run dvc pull in the newly cloned repository.

Expected

All files, regardless of their size, should be successfully pulled from JFrog Artifactory when running dvc pull.

What Happens Instead

When running dvc pull in the newly cloned repository, only the file with a size of 199KB (empirically tested threshold size) is successfully downloaded from JFrog Artifactory. The file with a size of 200KB is not pulled, resulting in an incomplete dataset.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 3.42.0 (pip)
-------------------------
Platform: Python 3.8.19 on macOS-10.16-x86_64-i386-64bit
Subprojects:
    dvc_data = 3.8.0
    dvc_objects = 3.0.6
    dvc_render = 1.0.1
    dvc_task = 0.4.0
    scmrepo = 2.0.4
Supports:
    http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
    https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3)
Config:
    Global: /Users/<User>/Library/Application Support/dvc
    System: /Library/Application Support/dvc
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s1s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk1s1s1
Repo: dvc, git
Repo.site_cache_dir: /Library/Caches/dvc/repo/52fdc8f5f325b94f5c48af1774bcaf5e

Additional Information (if any):

JFrog's documentation suggests using the following curl command to download a file:

curl -u <USER>:<PASSWORD> -L -O "https://jfrog.io/artifactory/dvc_store/<TARGET_FILE_PATH>"

The -L flag is used to handle redirects. When testing curl without the -L flag, only files smaller than 200KB can be downloaded successfully, while larger files result in an empty file.

From curl's man page:

-L, --location
              (HTTP) If the server reports that the requested page has moved to a different location (indicated with a Location: header and a 3XX response code), this option makes curl redo the request on the new place. If used together with
              -i, --include or -I, --head, headers from all requested pages are shown.

It appears that DVC is not properly handling the redirects from JFrog when attempting to download files larger than 200KB, leading to the issue where only files smaller than 200KB are successfully pulled.

shcheklein commented 3 months ago

Could you run curl in a verbose mode and see what kind of redirect it is?

In general, DVC http remote driver does handle redirects, e.g. our example projects are using remotes like:

https://remote.dvc.org/get-started

which is a redirect to S3.

I wonder what is so specific about that redirect?

Also, could you try to dcv pull in the example-get-started with the same DVC version, same packages installed, etc?

seakros commented 3 months ago

Yes, I tried to go through a bit of the source of dvc-data, dvc-objects, dvc-http and it would seem the libraries you use should automatically deal with redirects. So I'm not sure what is going on here. With JFrog, it seems to also redirect to an Amazon S3 bucket for the larger files (the ones that fail to get pulled)

Curl Output

Below is the verbose output form curl:

* IPv6: (none)
* IPv4: 3.228.154.22, 18.205.85.132, 54.81.195.252
*   Trying 3.228.154.22:443...
* Connected to jfrog.io (3.228.154.22) port 443
* ALPN: curl offers h2,http/1.1
* (304) (OUT), TLS handshake, Client hello (1):
} [319 bytes data]
*  CAfile: /etc/ssl/cert.pem
*  CApath: none
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* (304) (IN), TLS handshake, Server hello (2):
{ [108 bytes data]
* TLSv1.2 (IN), TLS handshake, Certificate (11):
{ [3701 bytes data]
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
{ [300 bytes data]
* TLSv1.2 (IN), TLS handshake, Server finished (14):
{ [4 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
} [37 bytes data]
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS handshake, Finished (20):
} [16 bytes data]
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
{ [1 bytes data]
* TLSv1.2 (IN), TLS handshake, Finished (20):
{ [16 bytes data]
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384 / [blank] / UNDEF
* ALPN: server accepted http/1.1
* Server certificate:
*  subject: CN=*.jfrog.io
*  start date: Jan 17 00:00:00 2024 GMT
*  expire date: Feb 16 23:59:59 2025 GMT
*  subjectAltName: host "*.jfrog.io" matched cert's "*.jfrog.io"
*  issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=GeoTrust TLS RSA CA G1
*  SSL certificate verify ok.
* using HTTP/1.x
* Server auth using Basic with user '<USER>'
> GET /artifactory/dvc/dvc_store/files/md5/be/9dc94aa32d037418803f90e719b84f HTTP/1.1
> Host: *.jfrog.io
> Authorization: Basic mNvbTpjbV[...]
> User-Agent: curl/8.6.0
> Accept: */*
>
< HTTP/1.1 302
< Date: Sat, 08 Jun 2024 20:58:30 GMT
< Transfer-Encoding: chunked
< Connection: keep-alive
< X-JFrog-Version: Artifactory/7.88.0 78800900
< X-Artifactory-Id: <ID>
< X-Artifactory-Node-Id: *-artifactory-primary-1
< Location: https://jfrog-prod-use1-shared-virginia-main.s3.amazonaws.com/aol-compX/filestore/24/24d3af8d03d58407631c68094dd1969b09b679b3?X-Artifactory-username=<USER>&X-Artifactory-repoType=local&X-Artifactory-repositoryKey=dvc&X-Artifactory-originPackageType=generic&X-Artifactory-packageType=generic&X-Artifactory-artifactPath=dvc_store%2Ffiles%2Fmd5%2Fbe%2F9dc94aa32d037418803f90e719b84f&X-Artifactory-originProjectKey=all&X-Artifactory-projectKey=all&X-Artifactory-originRepoType=local&X-Artifactory-originRepositoryKey=dvc&x-jf-traceId=9f39c937fca[...]&response-content-disposition=attachment%3Bfilename%3D%229dc9f90e719b84f%22&response-content-type=application%2Foctet-stream&X-Amz-Security-Token=<TOKEN>&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=<DATE>&X-Amz-SignedHeaders=host&X-Amz-Expires=60&X-Amz-Credential=<CREDENTIAL>&X-Amz-Signature=<SIGNATURE>
< X-Request-ID: ba6b6e246993c634d3293e4c1a69eedc
<
* Ignoring the response-body
* Leftovers after chunking: 5 bytes
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
* Connection #0 to host*.jfrog.io left intact
* Issue another request to this URL: 'https://jfrog-prod-use1-shared-virginia-main.s3.amazonaws.com/aol-compX/filestore/24/24d3af8d03d58407631c68094dd1969b09b679b3?X-Artifactory-username=<USER>&X-Artifactory-repoType=local&X-Artifactory-repositoryKey=dvc&X-Artifactory-originPackageType=generic&X-Artifactory-packageType=generic&X-Artifactory-artifactPath=dvc_store%2Ffiles%2Fmd5%2Fbe%2F9dc94aa32d037418803f90e719b84f&X-Artifactory-originProjectKey=all&X-Artifactory-projectKey=all&X-Artifactory-originRepoType=local&X-Artifactory-originRepositoryKey=dvc&x-jf-traceId=9f39c937fca[...]&response-content-disposition=attachment%3Bfilename%3D%229dc9f90e719b84f%22&response-content-type=application%2Foctet-stream&X-Amz-Security-Token=<TOKEN>
* Host jfrog-prod-use1-shared-virginia-main.s3.amazonaws.com:443 was resolved.
* IPv6: (none)
* IPv4: 52.216.206.27, 52.217.194.97, 54.231.171.185, 52.217.175.89, 52.217.202.97, 16.182.71.41, 52.216.209.137, 52.216.32.25
*   Trying 52.216.206.27:443...
* Connected to jfrog-prod-use1-shared-virginia-main.s3.amazonaws.com (52.216.206.27) port 443
* ALPN: curl offers h2,http/1.1
* (304) (OUT), TLS handshake, Client hello (1):
} [358 bytes data]
* (304) (IN), TLS handshake, Server hello (2):
{ [122 bytes data]
* (304) (IN), TLS handshake, Unknown (8):
{ [25 bytes data]
* (304) (IN), TLS handshake, Certificate (11):
{ [4980 bytes data]
* (304) (IN), TLS handshake, CERT verify (15):
{ [264 bytes data]
* (304) (IN), TLS handshake, Finished (20):
{ [36 bytes data]
* (304) (OUT), TLS handshake, Finished (20):
} [36 bytes data]
* SSL connection using TLSv1.3 / AEAD-AES128-GCM-SHA256 / [blank] / UNDEF
* ALPN: server accepted http/1.1
* Server certificate:
*  subject: CN=*.s3.amazonaws.com
*  start date: Apr 22 00:00:00 2024 GMT
*  expire date: Apr  7 23:59:59 2025 GMT
*  subjectAltName: host "jfrog-prod-use1-shared-virginia-main.s3.amazonaws.com" matched cert's "*.s3.amazonaws.com"
*  issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M01
*  SSL certificate verify ok.
* using HTTP/1.x
> GET /aol-compX/filestore/24/24d3af8d03d58407631c68094dd1969b09b679b3?X-Artifactory-username=<USER>&X-Artifactory-repoType=local&X-Artifactory-repositoryKey=dvc&X-Artifactory-originPackageType=generic&X-Artifactory-packageType=generic&X-Artifactory-artifactPath=dvc_store%2Ffiles%2Fmd5%2Fbe%2F9dc94aa32d037418803f90e719b84f&X-Artifactory-originProjectKey=all&X-Artifactory-projectKey=all&X-Artifactory-originRepoType=local&X-Artifactory-originRepositoryKey=dvc&x-jf-traceId=9f39c937fca[...]&response-content-disposition=attachment%3Bfilename%3D%229dc9f90e719b84f%22&response-content-type=application%2Foctet-stream&X-Amz-Security-Token=<TOKEN>&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=<DATE>&X-Amz-SignedHeaders=host&X-Amz-Expires=60&X-Amz-Credential=<CREDENTIAL>&X-Amz-Signature=<SIGNATURE>
> Host: jfrog-prod-use1-shared-virginia-main.s3.amazonaws.com
> User-Agent: curl/8.6.0
> Accept: */*
>
< HTTP/1.1 200 OK
< x-amz-id-2: Z26x6e92GKJ[...]
< x-amz-request-id: F8CX[...]
< Date: Sat, 08 Jun 2024 20:58:32 GMT
< x-amz-replication-status: COMPLETED
< Last-Modified: Fri, 07 Jun 2024 20:23:08 GMT
< ETag: "3741be[...]"
< x-amz-server-side-encryption: AES256
< x-amz-version-id: 8Yz42QO[...]
< Content-Disposition: attachment;filename="9dc94aa32d037418803f90e719b84f"
< Accept-Ranges: bytes
< Content-Type: application/octet-stream
< Server: AmazonS3
< Content-Length: 51825240
<
{ [1360 bytes data]
100 49.4M  100 49.4M    0     0  2444k      0  0:00:20  0:00:20 --:--:-- 6356k
* Connection #1 to host jfrog-prod-use1-shared-virginia-main.s3.amazonaws.com left intact

Pull from example-get-started

Tested out the pull from example-get-started as requested, and everything seems to work as expected here, i.e.

Applying changes                                                                                                                    |14.0 [00:00, 1.77kfile/s]
A       eval/
A       model.pkl
A       data/prepared/
A       data/data.xml
A       data/features/
5 files added and 17 files fetched

Let me know if there's anything else I could help supply to track down the issue.

seakros commented 3 months ago

I have resolved my problem. I had been playing around with the config and after a series of dvc destroys and subsequent inconsistent tracking of .dvc files, I had a dirty state of my repo - with some files hashed in my local cache whilst others weren't. I was only ever pushing a partial local cache.

Closing the issue.