Support caching repositories

japborst commented 2 years ago

Hello!

When using multi-gitter I noticed that on every run the respective repos are always pulled.

It would be great if this could be cached, to avoid long wait times to pull many repositories (especially when the entire org is specified).

lindell commented 2 years ago

I think it would be useful if it will not create to much problem for the user. How do you image this to work? 😄 Should the user set a cache timeout themselfs, and if they enable caching expect errors such as merge conflicts which they have to deal with manually.

Stephan202 commented 2 years ago

@lindell admitted I didn't deeply think about this yet, but a first version could implement an algorithm such a the following, given a $CACHE_ROOT directory (I'm assuming Github terminology):

If $CACHE_ROOT/$org/$repo doesn't exist: check out as usual.

If $CACHE_ROOT/$org/$repo does exist, execute a number of commands to get it into a pristine state:

git clean -fdx
git fetch --depth=[the-configured-fetch-depth]
git remote prune origin
git remote set-head origin -a
git checkout [the-configured-base-branch]
git reset --hard origin/[the-configured-base-branch]
# ^ If not configured, could run the semantic equivalent of e.g.:
#     git symbolic-ref refs/remotes/origin/HEAD \
#       | sed "s,^refs/remotes/origin/,," \
#       | xargs git checkout
#     git reset --hard refs/remotes/origin/HEAD 
git submodule update --recursive # If `multi-gitter` currently handles submodules; didn't check.

(I'm no Git guru, so perhaps there's a more straightforward way to reset the repository into a pristine state, containing the n most recent commit on the configured target branch, but the over-all gist would be the same: (a) re-use already-downloaded data, (b) update to match the most recent state, (c) clear any local modifications.)

I suppose there should also be a --trust-cached-repositories flag (better name TBD), so that during rapid prototyping the user can iterate on the script passed to multi-gitter run without incurring any IO overhead.

lindell commented 2 years ago

@Stephan202 So in that case, multi-gitter would still need to fetch from the remote. I guess this could speed up the process in some cases with very big repos and small changes 🤔 For those usecases it would indeed be useful.

Stephan202 commented 2 years ago

Indeed, we have a number of large repos that would benefit from this.

(Currently we have a repository containing all our other repositories as submodules, with various operations performed using git submodule foreach. This can be a bit unwieldy, but does have the benefit of repository state updates being decoupled from modification operations, which avoids extensive waiting between trials, even when on a slow network.)

japborst commented 2 years ago

To give a little more flavour to the size of the problem: in our case (and I imagine many other companies) running multi-gitter against the entire GitHub org means cloning hundreds of repos. Even using the default depth of 1, that still means fetching between a few MB up to - worst case - a GB.

lindell commented 2 years ago

I do agree that this is something that should be added! I will not have the time to look at this any time soon, but if you add it and create a PR, I'm happy to merge it 🙂

lindell / multi-gitter

Support caching repositories #235