iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.96k stars 1.19k forks source link

exp run: unnecessary hashing during experiments #10308

Open gregstarr opened 9 months ago

gregstarr commented 9 months ago

Bug Report

Description

Not sure if this is a bug per say, probably more of a discussion. I noticed that it was very slow to run experiments in parallel because it took a long time for them to start. This is because DVC is recomputing all the hashes for my large dataset.

DVC typically avoids recomputing hashes by utilizing a cache stored in site_cache_dir. The site cache dir on linux should be something like /var/tmp/dvc/repo/{hash}. This hash is computed here and is formed from several components including the root_dir (i.e. the dvc repo dir) and the btime which is sort of supposed to be the creation time of the root directory, but is instead taken from the mtime of the btime file in the .dvc/tmp folder.

When you run experiments in parallel, copies of the repo are made in the temp directory and the experiments are run from the copies. This means that the specific site cache dir for the repo copies will be different because the repo paths are different and the mtimes of the copied btime files are different. This results in DVC thinking that there is no cache yet and so it recomputes all the necessary hashes for each experiment. I have evidence of this because I only have one dvc repo, but my site cache dir has many cache folders.

Unless I'm missing something, it seems like experiments should use the same site cache as the base repo.

Reproduce

  1. look in your site cache dir, take note of the hashes
  2. run a bunch of experiments in parallel
  3. see that the site cache dir has more cache folders
$ ls -al /scratch/tmp/starrgw1/dvc/site_cache_dir/repo/
total 72
drwxrwxrwx 18 starrgw1 starrgw1 4096 Feb 17 06:09 .
drwxrwxr-x  3 starrgw1 starrgw1 4096 Feb 15 17:11 ..
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 16 18:25 048e839878f97ba9324bb139fa8e4b06
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 15 20:04 0c53c5b78086c5438b3ee6b4aaef570d
drwxrwxr-x  4 starrgw1 starrgw1 4096 Feb 17 06:09 1f18cf09ad43f0845bea96b6b719b3ee
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 15 20:04 378f0eae8f9824f1f96149c481621d03
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 15 20:04 441bd548b8b298abffb2449dc7c1cf54
drwxrwxr-x  4 starrgw1 starrgw1 4096 Feb 16 18:45 465bff9fb0df8bd1be46b6ec24fdb069
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 15 20:04 511df2ed3e7fdf1d12303c5929277158
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 16 16:46 73887b1a621845b9038bb7d3ec4ba704
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 16 20:17 7f4e8b33c6bc7ef879b1491b9ed50fec
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 16 16:46 82c6be6b9d97c42ec7ba7569d39a9a65
drwxrwxr-x  4 starrgw1 starrgw1 4096 Feb 16 18:45 aee9b76e8f486264f0800522304b53b0
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 15 18:53 d412c540ff7f186df3641073fe15a061
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 16 16:46 e4efc309f726450d1b3bdb37748a60d5
drwxrwxr-x  4 starrgw1 starrgw1 4096 Feb 16 18:45 e93d6446a825241907ed374d37e1f58d
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 16 16:46 f0fb5078327924424b4c3ae74fe98b46
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 16 18:25 f2d3168db34ebf88584f37903b9b3dcc

Environment information

dvc doctor
DVC version: 3.38.1 (pip)
-------------------------
Platform: Python 3.10.13 on Linux-3.10.0-693.el7.x86_64-x86_64-with-glibc2.17
Subprojects:
        dvc_data = 3.7.0
        dvc_objects = 3.0.3
        dvc_render = 1.0.0
        dvc_task = 0.3.0
        scmrepo = 2.0.2
Supports:
        http (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.1, aiohttp-retry = 2.8.3)
Config:
        Global: /home/starrgw1/.config/dvc
        System: /etc/xdg/dvc
Cache types: symlink
Cache directory: lustre on 192.168.199.212@o2ib:192.168.199.213@o2ib:/scratch
Caches: local
Remotes: local
Workspace directory: nfs on master:/home
Repo: dvc, git
Repo.site_cache_dir: /scratch/tmp/starrgw1/dvc/site_cache_dir/repo/d412c540ff7f186df3641073fe15a061
gregstarr commented 9 months ago

possibly related: https://github.com/iterative/dvc/issues/9813

gregstarr commented 2 months ago

any update on this?

gregstarr commented 2 weeks ago

Is there a reason that this is desired behavior? Or is this a bug?

shcheklein commented 2 weeks ago

@gregstarr I think it's a lack of a particular optimization as you described.

gregstarr commented 2 weeks ago

would it be reasonable to use the overall root dir in this situation instead of the one within tmp/exps/... ?

shcheklein commented 2 weeks ago

would it be reasonable to use the overall root dir in this situation instead of the one within tmp/exps/... ?

could you clarify your suggestion please?

gregstarr commented 2 weeks ago

I'm describing a possible optimization. Basically, should DVC check if its running an experiment in .dvc/tmp/exps/... and compute the site cache hash using the root directory, i.e. the parent of .dvc/ instead of the temp one?

Or is there some problem this will create?

shcheklein commented 2 weeks ago

Okay, I see. Yes, I think it should be possible in some way to do this. It makes sense. Either when we instantiate an experiment we can populate the site cache dir, or copy the whole cache (probably even better since we can prevent database from exploding), or we can indeed use the same but with a modified parent prefix.

@skshetry what is your take on this?