iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.86k stars 1.19k forks source link

pull: pulling single file is really slow when there's hundreds other .dvc files #8768

Open Marigold opened 1 year ago

Marigold commented 1 year ago

Bug Report

Description

We have about thousand small files in DVC. We're using Python API, though CLI has the same issue. We often need to add / pull a single new file so we use something like

from dvc.repo import Repo
repo = Repo("repo_root")
repo.pull("my_file.csv.dvc")

This takes almost 10 seconds, because DVC internally loads all stages before pulling that single file. I'd expect this to be almost instant. Why does it have to go through all the other dvc files? (my .dvcignore ignores as much files as possible, but the bottleneck is loading dvc files anyway)

Thanks!

Environment information

Output of dvc doctor:

DVC version: 2.38.1 (pip)
---------------------------------
Platform: Python 3.9.14 on macOS-12.5-x86_64-i386-64bit
Subprojects:
    dvc_data = 0.28.4
    dvc_objects = 0.14.0
    dvc_render = 0.0.15
    dvc_task = 0.1.8
    dvclive = 1.2.2
    scmrepo = 0.1.4
Supports:
    http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
    https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
    s3 (s3fs = 2022.11.0, boto3 = 1.24.59)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s5s1
Caches: local
Remotes: s3, https, s3
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc, git

Additional Information (if any):

shcheklein commented 1 year ago

@Marigold do you save each file as a separate DVC object? To rephrase, does it mean that you have ~10K .dvc files?

Marigold commented 1 year ago

@shcheklein thanks for a prompt response! Yes, we store each file as a separate DVC object, though we have <1K .dvc files. Every file has its own (custom) metadata so we thought this is the cleanest way. I was looking for ways how to "ignore" other files when doing dvc add/pull [target] (e.g. by monkey-patching DVCIgnore), but didn't find an easy solution.

skshetry commented 1 year ago

This is a side-effect of how we build an index when used_objs() is called. It should not read everything on repo.pull("my_file.csv.dvc"), we already optimize for this on other cases.

Marigold commented 1 year ago

Thanks for clarification @skshetry. We've hacked it by dynamically changing .dvcignore (also tried subrepos, but I ran into problems) so we're good for now, though it would be great if this worked fast out of the box.

skshetry commented 1 year ago

@Marigold, regarding dvc add/import, dvc needs to build a graph so that there are no overlaps/duplications/cycles, which means dvc has to read all .dvc files. There is a way to skip this, by setting repo._skip_graph_checks = True. But that is broken for the same reason as above.

I'll create a PR to fix that problem, should be fixed in future releases. Regarding, push/pull, I'll try to look into it.

Marigold commented 1 year ago

Much appreciated @skshetry! My hack with .dvcignore turned out to be bad idea, so we're stuck there (it's not a blocker for us, just annoying performance)

Marigold commented 1 year ago

@skshetry did you have a chance to look into this, please? As we scale our data it's becoming a bottleneck. If you don't have time for this, could you at least give me some hints where to fix it (or suggested workaround)?