src-d / borges

borges collects and stores Git repositories.
https://docs.sourced.tech/borges/
GNU General Public License v3.0
52 stars 20 forks source link

Change the way repositories are downloaded #380

Open jfontan opened 5 years ago

jfontan commented 5 years ago

Problem

To decrease the size of stored data we save the downloaded repositories in separate rooted repos. This greatly saves space but makes working with them a bit difficult:

Solution

Instead of separating each repository in its several rooted repo components store them all in one of them. To pick the rooted repo where the repository will be stored we can use the init commit from the default branch of the repository and use always the same one for updating. This data should be stored somewhere like database or an ad-hoc index. The name of this meta repository will be the init hash for the default branch the first time it is downloaded.

Boils down to:

The solution is similar to what GitHub uses (https://githubengineering.com/counting-objects/#your-very-own-fork-of-rails)

Advantages

Disadvantages

Migration path

We can change the way we store the repositories and continue downloading. It won't affect the queries we can do now. Eventually when the repositories are updated all their objects will be packed together in the same repository. To delete dangling rooted repo files we can use borges_tool and find the siva files not mentioned in the database.