Open casperdcl opened 4 years ago
The thing is cache is not persisted between dvc
runs, if we make it persist then that won't reclone only make git pull
in dvc update
.
yes; this is about making it persistent & pulling rather than re-cloning.
What about a repo cache at the user level? Could be a system config var so you can disable it, like analytics.
Context: #4203
in light of #4246 being merged going to downgrade priority here...
Persistent clones (as per #10511) are different from shallow clones (as per #4246). Both speed up cloning (or potentially avoid it) but only persistent clones can allow us to work with imported data without internet connectivity, which is necessary for us on a HPC where most queues have no connectivity.
Persistent clones would also allow us to separate cloning (which requires connectivity) from other dvc operations (which don't). This would allow us to do the former in an environment (queue) with connectivity and the latter in environments without.
@johnyaku Have you considered keeping a clone on a shared space of the HPC so you can import from there instead of from the internet? Even if dvc had some support for caching clones, it would likely still need to check the internet to fetch updates from those clones. If you have your own clone of the repo, you can fully control when to update it and everyone can share that single repo copy (dvc will not make a new clone of a local repo).
@dberenbaum I've been thinking along the same lines. We could maintain (and periodically update) repos on the local filesystem and specify the path to those repos instead of GitHub URLs. This would solve the no-internet access problem.
But we also want to maintain portability between platforms. (We work on two different HPCs, plus GCP.) So a URL that is accessible from any platform would be better from a portability perspective.
I can have calls to GitHub loopback to localhost in ~/.ssh/config
but then I'd need to change those settings back in order to update the local mirrors, which is potentially tedious (and error prone if there are other processes accessing ~/.ssh/config
at the same time). If I could maintain two separate configs then I might be able to get it to work but AFAIK the location of the SSH config is not configurable.
Happy to explore workarounds like this, or maybe dvc could keep the clones that is making already?