Bug Report

Description

When downloading files from a remote storage via dvc pull, the download stucks all 'number-of-jobs' files for about 10 seconds before continuing.

I have a folder on my remote storage containing ~1700 small files of 1gb size in total. While pulling, the download "stocks" every 64 files. After ~10 seconds the download continues for the next 64 files and then stocks again. The default number of jobs for a pull command is number_of_cpu_cores*4 which results in my case in 64. When changing the number of jobs to e.g. 10 via dvc pull -j 10 the download stocks every 10 files.

For the sake of testing, i set up two folders, one containing 10 large files and the other one containing ~1700 small files. Both folders are about 1GB of size. I measured the time (using time command) I need for downloading the files: large files: 0m12.766s small files: 5m21.227s

As expected, the download of the small files takes longer. But it takes way too long. It spends more then 2 minutes not downloading files and remains stuck. To make sure the problem does not come from the network, I tried it directly on the machine hosting the remote storage, with the same effect. I also tried downloading the files via a python script (using pyocclient) and it took 3m18.321s (>30% faster). I use nginx as reverse proxy to access the remote storage. The nginx.conf can be found below.

I am certain the problems comes from dvc or webdav and is either an issue with logins or dvc itself is taking too long to resolve the versioning overhead. I am at my wits end. I know downloading many small files will never be as fast as downloading the same amount of data as big files, but you can imagine the time overhead when scaling the number of files up when the download stocks every 64 files for 10 seconds.

Reproduce

dvc config:

[core]
    remote = test
['remote "test"']
    url = webdav://myserver/remote.php/dav/test
    user = user
    password = password

push all files to dvc and download them:

dvc add ./data/*.fileending
rm -r ./data/*.fileending
# clear cache
rm -r .dvc/cache
# pull and measure time
time dvc pull

nginx config: https://pastebin.com/azkVbzvx

postgresql.conf: https://pastebin.com/cBdUEhv3

Expected

The download not being stuck every number-of-jobs files for 10 seconds.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.8.3 (conda)
---------------------------------
Platform: Python 3.9.7 on Linux-5.16.14-arch1-1-x86_64-with-glibc2.35
Supports:
        webhdfs (fsspec = 2022.1.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        webdav (webdav4 = 0.9.4),
        webdavs (webdav4 = 0.9.4)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/cryptroot
Caches: local
Remotes: webdav
Workspace directory: ext4 on /dev/mapper/cryptroot
Repo: dvc, git

Additional Information (if any):

Things i already tried:

Analyse postgres database with postgresql tuner Output: https://pastebin.com/pkxmGeEK So i tweaked the following parameters with no effect:
- shared_buffers
- max_files_per_process
- enable_partitionwise_join
- enable_partitionwise_aggregate
- random_page_cost
- effective_cache_size
- max_connections
Try download locally (same machine as remote storage) and got same behavior
Try downloading files directly via pyocclient and got >30% speed impovement (see Description)
change number of jobs to 1 which results in a much higher download time (10m10.684s)
change number of jobs to 100 and 500 which also results in a higher download time. The download then stucks at number-of-jobs files.
Redis caching for remote storage with no improvement
Analyzing network traffic using wireshark. It seems the waiting time comes when dvc requests the files path of the next batch (no guarantee). Sadly i cannot share the log.

iterative / dvc-webdav

pull: download stocks all number-of-jobs files for 10 seconds #14

Bug Report

Description

Reproduce

Expected

Environment information