Closed patrickbrus closed 2 months ago
Hi, it seems that you're using a fairly old dvc
version. Is upgrading to a more recent version an option?
We also tested this behavior with dvc==2.27.2, but there we had the same observations. We also linked the latest source code version above in our report.
The graph check you noted is required since DVC needs to make sure your repo does not contain any overlapping outputs (i.e. multiple .dvc
files that point to the same output path). If you are generating thousands of .dvc files it will eventually start to degrade performance.
Do you need to import each file individually, or are you actually importing everything that is in the defined remote storage? If you are importing the entire remote contents it would be faster on the DVC performance side to do something like dvc import-url remote://rddl/ -o data/
(which would generate a single data.dvc
for the entire directory)
Closing as stale
Bug Report
Description
In our current scenario we want to use dvc together with a remote storage that is not a typical s3 bucket. Therefore we want to use dvc import-url to track and import data from that remote storage. We have to track 4000 files per class and the files itself are of type *.wav (data is taken from tensorflows' microspeech example -> Link).
We observed that the performance of dvc import-url drops as we progressed in importing more data points from the remote.
The first execution of dvc import-url took 1.5 seconds:
Pulling file 2000 already took more than 20 seconds:
And pulling file 3750 then took almost 50 seconds:
We took a look into the source code of this command and figured out that in line 73 an index is created and a graph is checked: https://github.com/iterative/dvc/blob/c73764156eeea688424bbee669b4279a8bb89b96/dvc/repo/imp_url.py#L72-L74
Could this be the reason for the performance drop as we got more files? If yes, are there any other ways to implement it? And is there even a possibility to not create that index and check the graph?
For us we don't want to utilize dvc push. We just care about making import-url work fast in our current setup.
Reproduce
As input we have a list of artifact IDs linking to files that are stored in the remote storage. We than loop over this list and for each element we run:
Expected
Performance won't drop after importing a larger amount of data files.
Environment information
Output of
dvc doctor
: