Open gregstarr opened 9 months ago
possibly related: https://github.com/iterative/dvc/issues/9813
any update on this?
Is there a reason that this is desired behavior? Or is this a bug?
@gregstarr I think it's a lack of a particular optimization as you described.
would it be reasonable to use the overall root dir in this situation instead of the one within tmp/exps/... ?
would it be reasonable to use the overall root dir in this situation instead of the one within tmp/exps/... ?
could you clarify your suggestion please?
I'm describing a possible optimization. Basically, should DVC check if its running an experiment in .dvc/tmp/exps/...
and compute the site cache hash using the root directory, i.e. the parent of .dvc/
instead of the temp one?
Or is there some problem this will create?
Okay, I see. Yes, I think it should be possible in some way to do this. It makes sense. Either when we instantiate an experiment we can populate the site cache dir, or copy the whole cache (probably even better since we can prevent database from exploding), or we can indeed use the same but with a modified parent prefix.
@skshetry what is your take on this?
Bug Report
Description
Not sure if this is a bug per say, probably more of a discussion. I noticed that it was very slow to run experiments in parallel because it took a long time for them to start. This is because DVC is recomputing all the hashes for my large dataset.
DVC typically avoids recomputing hashes by utilizing a cache stored in
site_cache_dir
. The site cache dir on linux should be something like/var/tmp/dvc/repo/{hash}
. This hash is computed here and is formed from several components including theroot_dir
(i.e. the dvc repo dir) and thebtime
which is sort of supposed to be the creation time of the root directory, but is instead taken from themtime
of the btime file in the.dvc/tmp
folder.When you run experiments in parallel, copies of the repo are made in the temp directory and the experiments are run from the copies. This means that the specific site cache dir for the repo copies will be different because the repo paths are different and the mtimes of the copied btime files are different. This results in DVC thinking that there is no cache yet and so it recomputes all the necessary hashes for each experiment. I have evidence of this because I only have one dvc repo, but my site cache dir has many cache folders.
Unless I'm missing something, it seems like experiments should use the same site cache as the base repo.
Reproduce
Environment information