Open SebastianRiechert2 opened 3 months ago
Do you know how it was corrupted? Is that a dvc bug, or some hardware issues?
We make the files under .dvc/cache
read-only, so that they cannot get corrupted afterwards.
Pretty sure the files are corrupted during upload to the remote (during dvc push
). I execute the following steps:
(1) deleting the remote (manual)
(2) pushing from a clean (no corrupted files) local cache (dvc push
)
(3) deleting the local cache (manually) and then
(4) running dvc pull
with or without verfiy=True in .dvc/config.
Which results in the behavior(s) described above.
The number and identity of corrupt files is the same, no matter if there are seconds or days between execution of step (2) and (4).
If I repeat everything (step 1-4), I get roughly the same amount of corrupted files, but different ones. All indications the corruption happens during dvc push.
During step (2), I am getting multiple "key exchange failed" and "Connection not open" errors. I suspect that has to do with the configurations of the ssh remote we are connecting to (over whose config we have no control) and the defaults dvc uses to interact with ssh remotes.
Running dvc push
a second time after step (2) completes without errors, but results in the behavior (corrupt files on remote) described in my initial post. Running dvc push
a third time does nothing ("Everything is up to date").
We've experienced corrupt files with SSH remotes too, altho not as often. We also think that the error happens during the push, maybe due to interrupted transfers.
We made a simple bash script to validate the remote, comparing the checksum against the file name to see if they match. It's a bit slow, but much better than letting corrupt data slip thru the cracks.
A "verify" option would be useful here, but the challenge is where to run it. In our case we ran our script directly on the SSH host, but this might not always be possible.
I imagine that interrupted transfers are less common for cloud and NFS remotes. But with SSH remotes they are real, and an official solution would be welcome.
My suggestion is dvc cache verify
, which would basically make sure that the checksums match. Or maybe dvc data status
includes this already? The idea would then be to create an instance of the repo on the SSH host, and then (locally) set the cache for that instance to be the remote. Then dvc cache verify
would in fact be verifying the remote.
For @SebastianRiechert2 my suggestion would be to have a long hard look at ~/.ssh/config
and sort out the key issues.
@skshetry can you confirm if we do an atomic rename when push object to SSH (first create a temporary file, then do mv
)?
maybe due to interrupted transfers.
@johnyaku do you mean Ctrl+C or some failure? do you see them often?
Looks like we don't do atomic transfer anymore (it was lost during batching implementation as we directly call fsspec's filesystem rather than our own implementation which does atomic renames).
s3 and other cloud storages don't have this problem, so a correct fix here would be to extend fsspec's implementation in dvc-ssh that does atomic renames.
Note that all these transfers happen in dvc_objects.fs.generic._put
.
do you mean Ctrl+C or some failure?
Didn't notice until much later, so hard to know for sure. But we mainly work on HPCs with walltime limits, so it is possible that a large transfer was interrupted when we ran out of walltime. I imagine the net effect would be similar to Ctrl+C.
do you see them often?
No, only once or twice. More recently, I'm sure I've seen atomic renames .. or at least some kind of temporary file. Although maybe this is just when pulling? I can see how this would go a long way to protecting against incomplete files ending up in the cache/remote.
Would still be nice to have a validation option 😃
I am facing some problems working with a ssh-remote. We are only using dvc for dataset versioning. We have little control over the remotes configuration and often when I push from local cache to remote, files arrive on remote corrupted. Multiple executions of
dvc push
do not fix the issue.dvc status -c
does not notice the corruption on remote even though the remote contains corrupted files and the local cache intact ones. The problem is not noticed when working alone, since dvc does not detect the corruption on remote, thinks remote and local cache are in sync anddvc pull
takes the files from the local cache. The problem arises when a third party tries to pull the dataset.dvc status -c
reveal (all corrupt files from remote appear as deleted).dvc checkout
(anddvc pull
) subsequently fails, I think because of the mismatch between remote and local cache.We are using
dvc repro
instead of manually adding files, so fixing the problem by manually untracking and readding the files is off the table as far as I understand.Shouldn't there be an option to check for corruption when pushing files to remote? Something like a
--verify
option fordvc push
.Right now, verification is only happening during pulling, at which point the corrupted files cannot be automatically fixed by re-pulling/pushing anymore. Also, the error we get when trying to pull from the corrupted remote (without intact local cache) with verification turned on is not very helpful. It just asks if the cache is up to date. The cache is up to date, the remote is the thing causing problems. We are currently testing a workaround, running rsync after each push. Since there are no dvc hooks (like there are git hooks), I do not see an elegant way of automating this.