Open dluks opened 1 month ago
I found a workaround by running dvc import --no-download
to first import the .dvc
file only, and then using dvc pull
several times to download the remaining data after the above-mentioned failures.
Note: this does not solve the underlying bug, but it is a workaround.
Thanks for the report @dluks and that workaround sounds like a good way around it for now. I'll keep this open but not sure it's actionable at the moment without a more consistent way to reproduce.
Bug Report
import/pull: hangs when pulling many files from GCS and one (or a few) fails
Description
I have a directory in a data registry that is tracked with DVC which contains ~1.6k files. I pushed it to a Google Cloud Storage remote, and now I am trying to either
dvc import
ordvc pull
the data from the GCS remote to a new machine. This works well for ~99% of the files, but sometimes a few seem to simply fail silently and never actually download.When this happens, the entire process hangs once all successful downloads have completed, leaving only the hung/failed/paused (not sure) downloads remaining. They never seem to resume and I am forced to
Ctrl+C
out of the process.This is less of an issue when using
dvc pull
as I can simply re-run the command and it will only download the missing files, but withdvc import
I am forced to re-run the process.Reproduce
dvc init
(data registry project)dvc add <large directory containing many moderately-sized files (each around 30MB)>
dvc remote add <google cloud storage bucket>
dvc push <google cloud storage>
git commit -am "commit msg" && git push
git init <new-project>
dvc import https://github.com/<data-registry-repo> path/to/large-dir
Most files finish downloading, but some fail (or hang) silently and prevent the entire
dvc import
from completing:I eventually have to
Ctrl-C
to exit the process or else it hangs for hours. Here is the full output before and after theCtrl-C
interrupt:Expected
I expect
dvc import
to detect download timeouts and either restart the hanging download or to complete the process with a warning and to then allow for a partialdvc import
to attempt to gather only the missing files instead of requiring a complete re-download of the data.Environment information
Output of
dvc doctor
:Additional Information (if any):