Closed seakros closed 3 months ago
Could you run curl in a verbose mode and see what kind of redirect it is?
In general, DVC http remote driver does handle redirects, e.g. our example projects are using remotes like:
https://remote.dvc.org/get-started
which is a redirect to S3.
I wonder what is so specific about that redirect?
Also, could you try to dcv pull
in the example-get-started
with the same DVC version, same packages installed, etc?
Yes, I tried to go through a bit of the source of dvc-data
, dvc-objects
, dvc-http
and it would seem the libraries you use should automatically deal with redirects. So I'm not sure what is going on here. With JFrog, it seems to also redirect to an Amazon S3 bucket for the larger files (the ones that fail to get pulled)
Below is the verbose output form curl:
* IPv6: (none)
* IPv4: 3.228.154.22, 18.205.85.132, 54.81.195.252
* Trying 3.228.154.22:443...
* Connected to jfrog.io (3.228.154.22) port 443
* ALPN: curl offers h2,http/1.1
* (304) (OUT), TLS handshake, Client hello (1):
} [319 bytes data]
* CAfile: /etc/ssl/cert.pem
* CApath: none
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* (304) (IN), TLS handshake, Server hello (2):
{ [108 bytes data]
* TLSv1.2 (IN), TLS handshake, Certificate (11):
{ [3701 bytes data]
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
{ [300 bytes data]
* TLSv1.2 (IN), TLS handshake, Server finished (14):
{ [4 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
} [37 bytes data]
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS handshake, Finished (20):
} [16 bytes data]
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
{ [1 bytes data]
* TLSv1.2 (IN), TLS handshake, Finished (20):
{ [16 bytes data]
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384 / [blank] / UNDEF
* ALPN: server accepted http/1.1
* Server certificate:
* subject: CN=*.jfrog.io
* start date: Jan 17 00:00:00 2024 GMT
* expire date: Feb 16 23:59:59 2025 GMT
* subjectAltName: host "*.jfrog.io" matched cert's "*.jfrog.io"
* issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=GeoTrust TLS RSA CA G1
* SSL certificate verify ok.
* using HTTP/1.x
* Server auth using Basic with user '<USER>'
> GET /artifactory/dvc/dvc_store/files/md5/be/9dc94aa32d037418803f90e719b84f HTTP/1.1
> Host: *.jfrog.io
> Authorization: Basic mNvbTpjbV[...]
> User-Agent: curl/8.6.0
> Accept: */*
>
< HTTP/1.1 302
< Date: Sat, 08 Jun 2024 20:58:30 GMT
< Transfer-Encoding: chunked
< Connection: keep-alive
< X-JFrog-Version: Artifactory/7.88.0 78800900
< X-Artifactory-Id: <ID>
< X-Artifactory-Node-Id: *-artifactory-primary-1
< Location: https://jfrog-prod-use1-shared-virginia-main.s3.amazonaws.com/aol-compX/filestore/24/24d3af8d03d58407631c68094dd1969b09b679b3?X-Artifactory-username=<USER>&X-Artifactory-repoType=local&X-Artifactory-repositoryKey=dvc&X-Artifactory-originPackageType=generic&X-Artifactory-packageType=generic&X-Artifactory-artifactPath=dvc_store%2Ffiles%2Fmd5%2Fbe%2F9dc94aa32d037418803f90e719b84f&X-Artifactory-originProjectKey=all&X-Artifactory-projectKey=all&X-Artifactory-originRepoType=local&X-Artifactory-originRepositoryKey=dvc&x-jf-traceId=9f39c937fca[...]&response-content-disposition=attachment%3Bfilename%3D%229dc9f90e719b84f%22&response-content-type=application%2Foctet-stream&X-Amz-Security-Token=<TOKEN>&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=<DATE>&X-Amz-SignedHeaders=host&X-Amz-Expires=60&X-Amz-Credential=<CREDENTIAL>&X-Amz-Signature=<SIGNATURE>
< X-Request-ID: ba6b6e246993c634d3293e4c1a69eedc
<
* Ignoring the response-body
* Leftovers after chunking: 5 bytes
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
* Connection #0 to host*.jfrog.io left intact
* Issue another request to this URL: 'https://jfrog-prod-use1-shared-virginia-main.s3.amazonaws.com/aol-compX/filestore/24/24d3af8d03d58407631c68094dd1969b09b679b3?X-Artifactory-username=<USER>&X-Artifactory-repoType=local&X-Artifactory-repositoryKey=dvc&X-Artifactory-originPackageType=generic&X-Artifactory-packageType=generic&X-Artifactory-artifactPath=dvc_store%2Ffiles%2Fmd5%2Fbe%2F9dc94aa32d037418803f90e719b84f&X-Artifactory-originProjectKey=all&X-Artifactory-projectKey=all&X-Artifactory-originRepoType=local&X-Artifactory-originRepositoryKey=dvc&x-jf-traceId=9f39c937fca[...]&response-content-disposition=attachment%3Bfilename%3D%229dc9f90e719b84f%22&response-content-type=application%2Foctet-stream&X-Amz-Security-Token=<TOKEN>
* Host jfrog-prod-use1-shared-virginia-main.s3.amazonaws.com:443 was resolved.
* IPv6: (none)
* IPv4: 52.216.206.27, 52.217.194.97, 54.231.171.185, 52.217.175.89, 52.217.202.97, 16.182.71.41, 52.216.209.137, 52.216.32.25
* Trying 52.216.206.27:443...
* Connected to jfrog-prod-use1-shared-virginia-main.s3.amazonaws.com (52.216.206.27) port 443
* ALPN: curl offers h2,http/1.1
* (304) (OUT), TLS handshake, Client hello (1):
} [358 bytes data]
* (304) (IN), TLS handshake, Server hello (2):
{ [122 bytes data]
* (304) (IN), TLS handshake, Unknown (8):
{ [25 bytes data]
* (304) (IN), TLS handshake, Certificate (11):
{ [4980 bytes data]
* (304) (IN), TLS handshake, CERT verify (15):
{ [264 bytes data]
* (304) (IN), TLS handshake, Finished (20):
{ [36 bytes data]
* (304) (OUT), TLS handshake, Finished (20):
} [36 bytes data]
* SSL connection using TLSv1.3 / AEAD-AES128-GCM-SHA256 / [blank] / UNDEF
* ALPN: server accepted http/1.1
* Server certificate:
* subject: CN=*.s3.amazonaws.com
* start date: Apr 22 00:00:00 2024 GMT
* expire date: Apr 7 23:59:59 2025 GMT
* subjectAltName: host "jfrog-prod-use1-shared-virginia-main.s3.amazonaws.com" matched cert's "*.s3.amazonaws.com"
* issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M01
* SSL certificate verify ok.
* using HTTP/1.x
> GET /aol-compX/filestore/24/24d3af8d03d58407631c68094dd1969b09b679b3?X-Artifactory-username=<USER>&X-Artifactory-repoType=local&X-Artifactory-repositoryKey=dvc&X-Artifactory-originPackageType=generic&X-Artifactory-packageType=generic&X-Artifactory-artifactPath=dvc_store%2Ffiles%2Fmd5%2Fbe%2F9dc94aa32d037418803f90e719b84f&X-Artifactory-originProjectKey=all&X-Artifactory-projectKey=all&X-Artifactory-originRepoType=local&X-Artifactory-originRepositoryKey=dvc&x-jf-traceId=9f39c937fca[...]&response-content-disposition=attachment%3Bfilename%3D%229dc9f90e719b84f%22&response-content-type=application%2Foctet-stream&X-Amz-Security-Token=<TOKEN>&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=<DATE>&X-Amz-SignedHeaders=host&X-Amz-Expires=60&X-Amz-Credential=<CREDENTIAL>&X-Amz-Signature=<SIGNATURE>
> Host: jfrog-prod-use1-shared-virginia-main.s3.amazonaws.com
> User-Agent: curl/8.6.0
> Accept: */*
>
< HTTP/1.1 200 OK
< x-amz-id-2: Z26x6e92GKJ[...]
< x-amz-request-id: F8CX[...]
< Date: Sat, 08 Jun 2024 20:58:32 GMT
< x-amz-replication-status: COMPLETED
< Last-Modified: Fri, 07 Jun 2024 20:23:08 GMT
< ETag: "3741be[...]"
< x-amz-server-side-encryption: AES256
< x-amz-version-id: 8Yz42QO[...]
< Content-Disposition: attachment;filename="9dc94aa32d037418803f90e719b84f"
< Accept-Ranges: bytes
< Content-Type: application/octet-stream
< Server: AmazonS3
< Content-Length: 51825240
<
{ [1360 bytes data]
100 49.4M 100 49.4M 0 0 2444k 0 0:00:20 0:00:20 --:--:-- 6356k
* Connection #1 to host jfrog-prod-use1-shared-virginia-main.s3.amazonaws.com left intact
example-get-started
Tested out the pull from example-get-started
as requested, and everything seems to work as expected here, i.e.
Applying changes |14.0 [00:00, 1.77kfile/s]
A eval/
A model.pkl
A data/prepared/
A data/data.xml
A data/features/
5 files added and 17 files fetched
Let me know if there's anything else I could help supply to track down the issue.
I have resolved my problem. I had been playing around with the config and after a series of dvc destroy
s and subsequent inconsistent tracking of .dvc files, I had a dirty state of my repo - with some files hashed in my local cache whilst others weren't. I was only ever pushing a partial local cache.
Closing the issue.
Bug Report
Description
DVC fails to pull files greater than 200KB in size when using JFrog Artifactory as the remote storage backend. The issue seems to be related to how DVC handles redirects from JFrog when attempting to download larger files.
Reproduce
dd
, one with a size of199KB
and another with a size of200KB
.dvc add
anddvc push
to add and push the files to JFrog Artifactory.dvc pull
in the newly cloned repository.Expected
All files, regardless of their size, should be successfully pulled from JFrog Artifactory when running
dvc pull
.What Happens Instead
When running
dvc pull
in the newly cloned repository, only the file with a size of 199KB (empirically tested threshold size) is successfully downloaded from JFrog Artifactory. The file with a size of 200KB is not pulled, resulting in an incomplete dataset.Environment information
Output of
dvc doctor
:Additional Information (if any):
JFrog's documentation suggests using the following curl command to download a file:
The
-L
flag is used to handle redirects. When testingcurl
without the-L
flag, only files smaller than 200KB can be downloaded successfully, while larger files result in an empty file.From
curl
's man page:It appears that DVC is not properly handling the redirects from JFrog when attempting to download files larger than 200KB, leading to the issue where only files smaller than 200KB are successfully pulled.