iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.88k stars 1.18k forks source link

import-url: Performance decrease with growing number of data files #8373

Closed patrickbrus closed 2 months ago

patrickbrus commented 2 years ago

Bug Report

Description

In our current scenario we want to use dvc together with a remote storage that is not a typical s3 bucket. Therefore we want to use dvc import-url to track and import data from that remote storage. We have to track 4000 files per class and the files itself are of type *.wav (data is taken from tensorflows' microspeech example -> Link).

We observed that the performance of dvc import-url drops as we progressed in importing more data points from the remote.

The first execution of dvc import-url took 1.5 seconds:

image

Pulling file 2000 already took more than 20 seconds:

image

And pulling file 3750 then took almost 50 seconds:

image

We took a look into the source code of this command and figured out that in line 73 an index is created and a graph is checked: https://github.com/iterative/dvc/blob/c73764156eeea688424bbee669b4279a8bb89b96/dvc/repo/imp_url.py#L72-L74

Could this be the reason for the performance drop as we got more files? If yes, are there any other ways to implement it? And is there even a possibility to not create that index and check the graph?

For us we don't want to utilize dvc push. We just care about making import-url work fast in our current setup.

Reproduce

As input we have a list of artifact IDs linking to files that are stored in the remote storage. We than loop over this list and for each element we run:

# added time command to measure execution time of this command
time dvc import-url remote://storage/artifactID destfolder/targetfilename

Expected

Performance won't drop after importing a larger amount of data files.

Environment information

Output of dvc doctor:

DVC version: 2.8.1 (pip)
---------------------------------
Platform: Python 3.9.7 on Linux-4.18.0-305.57.1.el8_4.x86_64-x86_64-with-glibc2.28
Supports:
        webhdfs (fsspec = 2021.10.0),
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2021.10.0, boto3 = 1.17.106)
dtrifiro commented 2 years ago

Hi, it seems that you're using a fairly old dvc version. Is upgrading to a more recent version an option?

patrickbrus commented 2 years ago

We also tested this behavior with dvc==2.27.2, but there we had the same observations. We also linked the latest source code version above in our report.

pmrowla commented 2 years ago

The graph check you noted is required since DVC needs to make sure your repo does not contain any overlapping outputs (i.e. multiple .dvc files that point to the same output path). If you are generating thousands of .dvc files it will eventually start to degrade performance.

Do you need to import each file individually, or are you actually importing everything that is in the defined remote storage? If you are importing the entire remote contents it would be faster on the DVC performance side to do something like dvc import-url remote://rddl/ -o data/ (which would generate a single data.dvc for the entire directory)

dberenbaum commented 2 months ago

Closing as stale