Open jbogar opened 1 year ago
@jbogar Are you using real s3 or some s3-compatible storage?
Real s3
@jbogar Do you use hardlink/symlink link types? Please show your dvc config --list
Output of dvc config --list (redacted with <...>):
remote.<remote>.url=s3://<bucket>/<remote>/
core.autostage=true
core.remote=<remote>
remote.<remote>.url=s3://<bucket>/<remote>
remote.<remote>.profile=<profile>
core.remote=<remote>
@jbogar Are you able to reliably reproduce this? Does this only happen with import git@github.com/some_repository
but not with clone git@github.com/some_repository
& pull?
Also, could you try modifying config in some_repository
to set verify=true
for the default remote? That will make dvc not trust the remote and verify file hashes locally.
I also have this exact problem
@johan-sightic Please post more info. E.g. dvc doctor
output.
@efiop I just broke my entire dvc setup trying to fix another problem but I will come back if I solve it
@efiop I tried to reproduce the issue. Here are the steps (without verify==true)
dvc update
ctrl+C
I started dvc update
again
It finished but gave this error:
ERROR: failed to transfer 'eb7f1a9933c0d5b59acbc89169f60d5c' - The difference between the request time and the current time is too large.: An error occurred (RequestTimeTooSkewed) when calling the GetObject operation: The difference between the request time and the current time is too large.
And three out of 40 files have malformed content (not empty, malforrmed).
So this may be one reason for this: interrupted updates or this timing issue.
EDIT: I tried again, this time without interrupt. The same error about timing appeared for multiple files again, but this time, data is correct. So in the previous case, it might have been the interrupt.
EDIT2: I reproduced again, no timing errors this time. The steps:
@jbogar That's really strange. We try to download to temporary files and then just move them into place to be semi-atomic and avoid any corruption. Maybe there is a regression with the recent transfer changes, will take a look.
@pmrowla FYI ^
@jbogar Btw, could you try downgrading s3fs to 2022.11.0 and see if that makes it work again? I'm not able to reproduce so far.
Bug Report
import: malformed files
Description
some_repository
, I added each file insome_directory
to dvc (so each file in the directory has its own .dvc file)dvc import git@github.com/some_repository some_directory -o another_directory
, so that there is a single .dvc file for whole directoryHowever, some of the files were malformed (.jsonl files, some lines were cut off in the middle). After deleting the files from cache and running dvc update, the files would be downloaded correctly, but other files, at random, became malformed.
I manually checked the hash of the files. The hash of the malformed files was different than the file hash in the .dvc/cache/, so it's likely they were malformed during download.
Colleague observed the same behavior (with different files) with dvc==2.45.0, my dvc version is 2.43.2
Expected
That files are not malformed or raise an error if they are. Command that recalculates hashes and checks whether files are downloaded correctly would also be useful.
Output of
dvc doctor
: