seeraven / gitcache

Local cache for git repositories to speed up working with large repositories and multiple clones.
BSD 3-Clause "New" or "Revised" License
33 stars 6 forks source link

Regex to match git URLs for caching #7

Closed jaw closed 2 years ago

jaw commented 2 years ago

We have a setup with multiple submodules (recursive) on an internal gitlab server. Then those in turn load in submodules from github, usually libraries. Those don't have to be updated very often while the local ones have to be updated on every build.

So, if you could add a string or regex as a config parameter that must match the URL for the cache to actually do its caching and pass everything else through would be awesome.

Default this parameter could be "" aka don't care.

jaw commented 2 years ago

So in our case the config parameter would be:

OnlyCacheWhenURLContains "github.com"

or similar.

seeraven commented 2 years ago

Hi!

Good idea! I think I'll add two configuration options, one for including in the cache as you described and one for excluding. The logic would be that to actually cache a repository during the initial clone the include pattern must match whereas the exclude pattern must not match. Checks for operations other than clone and submodule update are not required, since gitcache checks the remote of the checked out repository and updates the cache only if the remote is a folder within the GITCACHE_DIR.

Default configuration would be to include all and exclude nothing, so something like

[UrlPatterns]
include = .*
exclude = 

So in your case it would be something like

[UrlPatterns]
include = .*github\.com\/.*
exclude = .*

The question is of course whether a full regex should be used or a simpler shell-wildcard support suffices. Regex would allow out of the box matching of multiple different patterns, but is much harder to get it right at the first time. So an alternative would be to use the python fnmatch module which supports '' to match everything and '?' to match a single character. The pattern `.github.com\/.would become simplygithub.com/. To support multiple patterns a:is probably not a good choice as it is usually part of the URL, but;should be quite uncommon in URLs, so multiple patterns could be specified as something likegithub.com/;external.com/;https://a.single.server.com/and/this/repo`. What do you think?

jaw commented 2 years ago

I don't see a problem with regex as long as one gives some examples of how to match github and maybe a certain project on github as example 2.

Another option would be an array, like:

[UrlPatterns]
include = [".*github\.com\/.*", ".*bitbucket\.com\/.*"]
exclude = [".*"]

But you could go with 4 parameters, 2 easy and 2 regex:

[UrlPatterns]
include = ["github.com"]
exclude = [".*"]
include_regex = []
exclude_regex = []

If arrays are empty, don't do anything?

seeraven commented 2 years ago

I've found a little time today to work on this feature (commit https://github.com/seeraven/gitcache/commit/33d4e4a359c0cbf54747e08cf3b1957067be4009). It is not finished yet. At the moment, only the clone command is handled, and the functional tests must be extended too. But you should be able to see the gist. ;-)

seeraven commented 2 years ago

Hi! I've just merged the changes and created a new release. It would be cool if you can test it in your specific scenario and give feedback on how it is going.

jaw commented 2 years ago

Yes, will do, I'll do #8 first and then look at this.

jaw commented 2 years ago

I've confirmed that this works well too, closing! Awesome!