Open Marigold opened 1 year ago
@Marigold do you save each file as a separate DVC object? To rephrase, does it mean that you have ~10K .dvc
files?
@shcheklein thanks for a prompt response! Yes, we store each file as a separate DVC object, though we have <1K .dvc
files. Every file has its own (custom) metadata so we thought this is the cleanest way. I was looking for ways how to "ignore" other files when doing dvc add/pull [target]
(e.g. by monkey-patching DVCIgnore
), but didn't find an easy solution.
This is a side-effect of how we build an index when used_objs()
is called. It should not read everything on repo.pull("my_file.csv.dvc")
, we already optimize for this on other cases.
Thanks for clarification @skshetry. We've hacked it by dynamically changing .dvcignore
(also tried subrepos, but I ran into problems) so we're good for now, though it would be great if this worked fast out of the box.
@Marigold, regarding dvc add/import
, dvc needs to build a graph so that there are no overlaps/duplications/cycles, which means dvc has to read all .dvc
files. There is a way to skip this, by setting repo._skip_graph_checks = True
. But that is broken for the same reason as above.
I'll create a PR to fix that problem, should be fixed in future releases.
Regarding, push
/pull
, I'll try to look into it.
Much appreciated @skshetry! My hack with .dvcignore
turned out to be bad idea, so we're stuck there (it's not a blocker for us, just annoying performance)
@skshetry did you have a chance to look into this, please? As we scale our data it's becoming a bottleneck. If you don't have time for this, could you at least give me some hints where to fix it (or suggested workaround)?
Bug Report
Description
We have about thousand small files in DVC. We're using Python API, though CLI has the same issue. We often need to add / pull a single new file so we use something like
This takes almost 10 seconds, because DVC internally loads all stages before pulling that single file. I'd expect this to be almost instant. Why does it have to go through all the other dvc files? (my
.dvcignore
ignores as much files as possible, but the bottleneck is loading dvc files anyway)Thanks!
Environment information
Output of
dvc doctor
:Additional Information (if any):