seeraven / gitcache

Local cache for git repositories to speed up working with large repositories and multiple clones.
BSD 3-Clause "New" or "Revised" License
37 stars 6 forks source link

Fix resolving relative submodule urls for all cases #26

Closed Youw closed 7 months ago

Youw commented 7 months ago

Resolves: #24

seeraven commented 7 months ago

Thank you for your contribution! I've merged it and will add tests for it.

Youw commented 7 months ago

Apparently, I've used some features from python 3.9+ (e.g. removeprefix), which breaks runtime on Ubuntu20.04 with python 3.8. Working on a fix.

seeraven commented 7 months ago

I've added now a unit test and functional tests to master. You can run the unit test explicitly by calling

UNITTESTS="GitCacheResolveSubmoduleUrlTest" make unittests

They are currently failing. :-( For once, the repos can be specified with a trailing / (the remote_url as well as the submodule_url). You already had a part in it that removes the trailing slash for the remote_url, but the rstrip method is not in-place, so that had no effect. ;-)

The other problem is that there is no path normalization, so intermediate parts like in ../not_here/../but_here are not normalized. This will probably work with git but the internal database does not normalize the paths and that would result in two entries for on repo.

Thanks again for your effort!

Youw commented 7 months ago

The other problem is that there is no path normalization

That is done on purpose. I take the behavior of git itself as a reference. E.g. if a repo is cloned as https://github.com/user/foo.git and a submodule is ../a/../bar.git, running git submodule init would produce something like 'bar' is registered with url https://github.com/user/a/../bar.git, i.e. only relative parts at the beginning are being resolved.

Youw commented 7 months ago

https://github.com/user/a/../bar.git

Even though I'm not aware of such valid urls, I'm just trying to preserve same behavior as git would do it on its own.

Youw commented 7 months ago

For once, the repos can be specified with a trailing / (the remote_url as well as the submodule_url)

Right. Fixed trailing '/' for remote_url. As for trailing '/' for submodule_url (oroginal or "resolved") - I'd keep things as is as well. Again, to match the behavior of the original git tool.

seeraven commented 7 months ago

Yeah git is quite lazy in that respect and relies on the server to normalize the path. That is even true with the trailing / on the URLs. git does not need to know whether github.com/seeraven/gitcache.git and `github.com/seeraven/something/stupid/../../gitcache.git/" point to the same repository, but a cache would like to know that it is the same repository and not something different.

Youw commented 7 months ago

but a cache would like to know that it is the same repository and not something different.

Sounds like best if it would be a responsibility of get_mirror_path function. In which case we can 100% match the behavior of data fetching with the original git tool, and "optimize" the cache path.

seeraven commented 7 months ago

You're reading my mind - I'm currently working on it. ;-) So then we use the new resolve_submodule_url function to generate the target path like git would do and when gitcache is called again for cloning the path is normalized.