iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.87k stars 1.19k forks source link

import/update: cache git repos/clones #3496

Open casperdcl opened 4 years ago

casperdcl commented 4 years ago
dvc import https://some/git/repo/ some_file
dvc update  # should not re-clone, should only pull into existing cache
Suor commented 4 years ago

The thing is cache is not persisted between dvc runs, if we make it persist then that won't reclone only make git pull in dvc update.

casperdcl commented 4 years ago

yes; this is about making it persistent & pulling rather than re-cloning.

jorgeorpinel commented 4 years ago

What about a repo cache at the user level? Could be a system config var so you can disable it, like analytics.

Context: #4203

casperdcl commented 3 years ago

in light of #4246 being merged going to downgrade priority here...

johnyaku commented 2 months ago

Persistent clones (as per #10511) are different from shallow clones (as per #4246). Both speed up cloning (or potentially avoid it) but only persistent clones can allow us to work with imported data without internet connectivity, which is necessary for us on a HPC where most queues have no connectivity.

Persistent clones would also allow us to separate cloning (which requires connectivity) from other dvc operations (which don't). This would allow us to do the former in an environment (queue) with connectivity and the latter in environments without.

dberenbaum commented 2 months ago

@johnyaku Have you considered keeping a clone on a shared space of the HPC so you can import from there instead of from the internet? Even if dvc had some support for caching clones, it would likely still need to check the internet to fetch updates from those clones. If you have your own clone of the repo, you can fully control when to update it and everyone can share that single repo copy (dvc will not make a new clone of a local repo).

johnyaku commented 2 months ago

@dberenbaum I've been thinking along the same lines. We could maintain (and periodically update) repos on the local filesystem and specify the path to those repos instead of GitHub URLs. This would solve the no-internet access problem.

But we also want to maintain portability between platforms. (We work on two different HPCs, plus GCP.) So a URL that is accessible from any platform would be better from a portability perspective.

I can have calls to GitHub loopback to localhost in ~/.ssh/config but then I'd need to change those settings back in order to update the local mirrors, which is potentially tedious (and error prone if there are other processes accessing ~/.ssh/config at the same time). If I could maintain two separate configs then I might be able to get it to work but AFAIK the location of the SSH config is not configurable.

Happy to explore workarounds like this, or maybe dvc could keep the clones that is making already?