When downloading files from a remote storage via dvc pull, the download stucks all 'number-of-jobs' files for about 10 seconds before continuing.
I have a folder on my remote storage containing ~1700 small files of 1gb size in total. While pulling, the download "stocks" every 64 files. After ~10 seconds the download continues for the next 64 files and then stocks again. The default number of jobs for a pull command is number_of_cpu_cores*4 which results in my case in 64. When changing the number of jobs to e.g. 10 via dvc pull -j 10 the download stocks every 10 files.
For the sake of testing, i set up two folders, one containing 10 large files and the other one containing ~1700 small files. Both folders are about 1GB of size. I measured the time (using time command) I need for downloading the files:
large files: 0m12.766s
small files: 5m21.227s
As expected, the download of the small files takes longer. But it takes way too long. It spends more then 2 minutes not downloading files and remains stuck.
To make sure the problem does not come from the network, I tried it directly on the machine hosting the remote storage, with the same effect. I also tried downloading the files via a python script (using pyocclient) and it took 3m18.321s (>30% faster). I use nginx as reverse proxy to access the remote storage. The nginx.conf can be found below.
I am certain the problems comes from dvc or webdav and is either an issue with logins or dvc itself is taking too long to resolve the versioning overhead. I am at my wits end. I know downloading many small files will never be as fast as downloading the same amount of data as big files, but you can imagine the time overhead when scaling the number of files up when the download stocks every 64 files for 10 seconds.
Reproduce
dvc config:
[core]
remote = test
['remote "test"']
url = webdav://myserver/remote.php/dav/test
user = user
password = password
push all files to dvc and download them:
dvc add ./data/*.fileending
rm -r ./data/*.fileending
# clear cache
rm -r .dvc/cache
# pull and measure time
time dvc pull
Try download locally (same machine as remote storage) and got same behavior
Try downloading files directly via pyocclient and got >30% speed impovement (see Description)
change number of jobs to 1 which results in a much higher download time (10m10.684s)
change number of jobs to 100 and 500 which also results in a higher download time. The download then stucks at number-of-jobs files.
Redis caching for remote storage with no improvement
Analyzing network traffic using wireshark. It seems the waiting time comes when dvc requests the files path of the next batch (no guarantee). Sadly i cannot share the log.
Bug Report
Description
When downloading files from a remote storage via dvc pull, the download stucks all 'number-of-jobs' files for about 10 seconds before continuing.
I have a folder on my remote storage containing ~1700 small files of 1gb size in total. While pulling, the download "stocks" every 64 files. After ~10 seconds the download continues for the next 64 files and then stocks again. The default number of jobs for a pull command is number_of_cpu_cores*4 which results in my case in 64. When changing the number of jobs to e.g. 10 via dvc pull -j 10 the download stocks every 10 files.
For the sake of testing, i set up two folders, one containing 10 large files and the other one containing ~1700 small files. Both folders are about 1GB of size. I measured the time (using time command) I need for downloading the files: large files: 0m12.766s small files: 5m21.227s
As expected, the download of the small files takes longer. But it takes way too long. It spends more then 2 minutes not downloading files and remains stuck. To make sure the problem does not come from the network, I tried it directly on the machine hosting the remote storage, with the same effect. I also tried downloading the files via a python script (using pyocclient) and it took 3m18.321s (>30% faster). I use nginx as reverse proxy to access the remote storage. The nginx.conf can be found below.
I am certain the problems comes from dvc or webdav and is either an issue with logins or dvc itself is taking too long to resolve the versioning overhead. I am at my wits end. I know downloading many small files will never be as fast as downloading the same amount of data as big files, but you can imagine the time overhead when scaling the number of files up when the download stocks every 64 files for 10 seconds.
Reproduce
dvc config:
push all files to dvc and download them:
nginx config: https://pastebin.com/azkVbzvx
postgresql.conf: https://pastebin.com/cBdUEhv3
Expected
The download not being stuck every number-of-jobs files for 10 seconds.
Environment information
Output of
dvc doctor
:Additional Information (if any):
Things i already tried: