To decrease the size of stored data we save the downloaded repositories in separate rooted repos. This greatly saves space but makes working with them a bit difficult:
A single repository with several non connected tree histories are split in several rooted repositories. To go over all its data we have to iterate over several locations. These locations may not be in the same computer / disk.
Updating repositories means separating the objects for each rooted repo. This process is very time consuming. It also discards the work done at server's side.
Solution
Instead of separating each repository in its several rooted repo components store them all in one of them. To pick the rooted repo where the repository will be stored we can use the init commit from the default branch of the repository and use always the same one for updating. This data should be stored somewhere like database or an ad-hoc index. The name of this meta repository will be the init hash for the default branch the first time it is downloaded.
Boils down to:
First time: find the init commit for the default branch
Updates: use the rooted repository written in the database
Downloading and updating the repository can reuse the packfile sent by the server instead of repacking. This should be much faster than the current system.
Eases an optimization to download only new objects as the whole repository is in only one place.
Much easier to extract information about the whole repository as there's no need to query several pieces.
There may be unforeseen advantages in some repositories where some files are repeated in unrelated history trees. For example image files in both master and gh-pages.
Disadvantages
There may be repositories that contain a tree history that is not pointed by the default branch and is already stored in a rooted repository. This makes it consume more space.
The PGA will have a different format. It will be easier to query but the tools and documentation have to be changed.
Rooted repos split some big repositories in slices. This had the side effect of creating several partitions per repo, enabling parallelism in gitbase.
Putting all repositories and branches in the same repository will make them bigger. This may have a performance penalty in some big repos. Big indexes, huge amount of references, lots of objects, etc.
Database needs to be changed to hold the name of the repository.
Needs database to persist the name of the repository. The current system don't really need database as the rooted repos init commits are always calculated.
Migration path
We can change the way we store the repositories and continue downloading. It won't affect the queries we can do now. Eventually when the repositories are updated all their objects will be packed together in the same repository. To delete dangling rooted repo files we can use borges_tool and find the siva files not mentioned in the database.
Problem
To decrease the size of stored data we save the downloaded repositories in separate rooted repos. This greatly saves space but makes working with them a bit difficult:
Solution
Instead of separating each repository in its several rooted repo components store them all in one of them. To pick the rooted repo where the repository will be stored we can use the init commit from the default branch of the repository and use always the same one for updating. This data should be stored somewhere like database or an ad-hoc index. The name of this meta repository will be the init hash for the default branch the first time it is downloaded.
Boils down to:
The solution is similar to what GitHub uses (https://githubengineering.com/counting-objects/#your-very-own-fork-of-rails)
Advantages
master
andgh-pages
.Disadvantages
Migration path
We can change the way we store the repositories and continue downloading. It won't affect the queries we can do now. Eventually when the repositories are updated all their objects will be packed together in the same repository. To delete dangling rooted repo files we can use borges_tool and find the siva files not mentioned in the database.