Closed Manishearth closed 8 months ago
Note that this is blocked on https://github.com/libgit2/libgit2/issues/2782.
And to clarify some more, I would love to implement this! The backing libgit2 support is a prerequisite, however (see https://github.com/libgit2/libgit2/issues/1430 as well)
Both of those blocked issues in libgit are now fixed :-)
Ah unfortunately they were closed in favor of a new metabug in libgit2: https://github.com/libgit2/libgit2/issues/3058 I'll update the description accordingly
Darn, got my hopes up :-)
This issue is less important to Servo now that crates.io supports larger crate sizes and we are able to shift more of our dependency graph from raw git to crates.io.
This would be nice for crates.io-index too, not just git dependencies. A current full clone of the index takes 39MB network and 48MB .git/
on disk, whereas --depth 1
is only 4.5MB network and 6MB on disk.
@cuviper oh it's worth pointing out that for the index at least we planned on this happening over time (clone being large) and we wanted to protect against that. It should be implemented that at any time we can roll the entire index into one commit (e.g. one huge squash) and force push that up. That, in effect, should reset everyone to a "shallow clone".
We've never done that yet so we don't quite have a procedure and/or threshold to do that, but I should probably verify that we can actually do that soon :). Other than that all we'd need to do is briefly turn off crates.io, do the squash, force push, and then turn crates.io back. All instances of Cargo should then pick up the new commit and hard reset to it when they next need to update the registry.
Hmm, will the history be saved to a branch before the reset? I'm not sure it's important, but it seems like we shouldn't just throw it all away. And FWIW git has a --single-branch
option (implied by --depth
) which we would want to use to get only the squashed master. It seems libgit2 can also do this by adjusting the remote refspec.
Ah excellent point, I think we'd definitely want a backup somewhere. I don't personally know a huge amount about git refspecs, but my gut is that Cargo is cloning all branches of the index today (and all historical versions of Cargo). I think it'd be easy to update that though to only fetch master
!
Yeah, the refspec just becomes "refs/heads/master:refs/remotes/origin/master"
.
I've filed an issue to update that
Hey, any changes to this? I'm depending on my own forks of glfw & yoga which I'd like to rather not publish to crates.io yet - fetching it takes considerable time.
I think the progress of this issue is completely related to https://github.com/libgit2/libgit2/pull/5254, because IIRC cargo use the git2 rust library and it has recently made progress on shallow clones.
I think the progress of this issue is completely related to libgit2/libgit2#5254, because IIRC cargo use the git2 rust library and it has recently made progress on shallow clones.
@Kerollmops
Actual, there is more than that, since GitHub limited shallow clone on large repository, I don't think this would be work even git2
support shallow clone.
There is an related RFC for pure HTTP registry index, maybe that would be a solution.
example in the wild:
downstream use case:
in nixpkgs
there is rustPlatform.fetchCargoTarball which only needs shallow clones
deep clones cannot be shared between fetches, so deep clones actually increase the server load
Shallow clones are discouraged by github as they increase load on their servers (source)
in nixpkgs
, many packages use fetchFromGitHub
which uses github's archive API for better performance than git clone
the downside of using the archive API is:
there is no simple way to verify the files by the commit hash, because the commit object is *not* included in the download (could be added as http header), and the github repos API is lossy (timezones are missing)
tree hash of files + commit object → commit hash
as workaround, fetchFromGitHub
requires an additional sha256
hash to verify files
example
c=d41f5ccd7ea91afee4f1a9d20b85dbcede135d3b
git clone --depth 10 https://github.com/rust-lang/cargo
git checkout $c
git log | head -n3 | grep ^Date
# Date: Wed Mar 2 08:41:44 2022 -0600
# -> timezone is -0600
curl https://api.github.com/repos/rust-lang/cargo/commits/$c | jq .commit.author.date
# 2022-03-02T14:41:44Z
# -> no timezone
curl -I -L https://github.com/rust-lang/cargo/archive/$c.tar.gz
# -> no commit object in http headers
(maybe someone with a better reputation than me could convince github/gitlab/gitea to implement this ...)
there is no simple way to verify the files by the commit hash
It is however possible to fetch by the commit hash, so maybe this is fine? If you ask for commit $FOO over TLS, then then the result can be trusted insofar as that github is.
It is however possible to fetch by the commit hash
yepp, but its 2x slower than github's archive API
mkdir d
cd d
git init
git remote add o https://github.com/postgres/postgres
time git fetch o 5b68f75e12831cd5b7d8058320c0ca29bbe76067 --depth 1
# real 0m13.084s
time git checkout 5b68f75e12831cd5b7d8058320c0ca29bbe76067
# real 0m3.814s
time wget https://github.com/postgres/postgres/archive/5b68f75e12831cd5b7d8058320c0ca29bbe76067.tar.gz
# real 0m6.431s
ask for commit $FOO over TLS
TLS/SSL does not help here, because the data is already content-addressed
Your example is of fetching a specific commit hash over the archive API. TLS helps because it ensures that the data is authentic and not corrupt.
There is https://github.com/Byron/gitoxide project, which aims to implement Git in pure Rust.
It seems to be quite far along and maybe can be used for at least a subset of use cases in the meantime? It is also under active development. I'd be happy to put a few bucks into development that would remove ~10 minutes from each CI run.
cc @Byron just in case.
Thanks for the CC! I was aiming at one day find a way into cargo
and created this tracking issue for additionally oxidization.
There is a lot of work still to be done, and even though gitoxide
deals with shallow histories gracefully by default and can receive packs that are shallow, there is a lot more around transport and credentials that needs to be carefully ported to achieve feature parity with git2
and support shallow clones.
That said, I also plan to apply for a Rust foundation grant this month and think that bringing early gitoxide
support for cargo
with focus on resolving this issue would be great value for everyone.
What about supporting shallow clone in conjunction with net.git-fetch-with-cli=true
at least?
More than 7 years later this issue is still blocked on libgit2, and I think it's time to change the approach from waiting to switching to something else :).
I am happy to announce that the Rust Foundation provided a grant to resolve this issue specifically by the end of this year, and I am here to kick it off by starting a conversation.
Due to the structure of the grant there already is a frame suggested by me which says that in about three months, we should be able to perform the initial clone of the crates index in a shallow fashion. Doing so is only technically related and not part of this issue at all, thus it's something we would need to sort out beforehand. In my initial naivetΓ© I picked this as mid-term goal because it would require the following capabilities from gitoxide
cargo
usesDoing so seems to make sense as well as shallow vs non-shallow clones of the crates index are 5.3x faster and consume only 25% of the disk space for .git
itself (the related research is in my research). I'd consider shallow clones of the crates index as a stepping stone for bare clones as well, which seem to be able to save an additional 700MB on disk space right off the bat and will definitely speed up initial clones further.
From there, the 'only' thing missing to resolve the issue of shallow clones for crates, merely from the perspective of capabilities in gitoxide
, would be to perfectly checkout and potentially update worktrees, which can be quite involved thanks to git-attributes and git-submodules.
I will be able to move freely related to the implementation of required capabilities in gitoxide
and will proceed with the implementation once the initial analysis is completed.
Integrating gitoxide
into the relevant parts of cargo
requires alignment which I hope we can drive forward here or as part of any process that may be more suited for that.
For me it would be most helpful if I could get guidance on the grants mid-term which has gitoxide
create shallow clones of the crates index. Is it feasible and if so how to best proceed from here, or what could other mid-term goals look like alternatively? I would feel comfortable the integration could start within 6 weeks for now, leaving about 6 weeks to drive it to completion with availability in nightly, maybe.
Thanks a lot for your help :)!
@Byron would this only be replacing libgit2 for clones but continue to use it for everywhere else or is this the last item needed to completely switch to gitoxide?
If we need to still use libgit2, I wonder what the trade offs look like vs waiting until we can completely switch (build time, binary size, error messages, behavior differences, etc).
@Byron I am very excited for your grant! I look forward to us working together!
We should be able to perform the initial clone of the crates index in a shallow fashion. Doing so is only technically related and not part of this issue at all, thus it's something we would need to sort out beforehand.
From our conversations with GitHub, doing a shallow clone is fine, but then attempting to do a fetch from that clone creates a lot of work on their servers. Shallow cloning the index is not a viable solution. #9069 is actively making progress to removing git from the index entirely. Happy to talk more about the details, but as you pointed out it's off-topic for this issue.
I'd consider shallow clones of the crates index as a stepping stone for bare clones as well, which seem to be able to save an additional 700MB on disk space right off the bat and will definitely speed up initial clones further.
I believe we already do a bear clone of the index. Is that what you're referring to?
If we need to still use libgit2, I wonder what the trade offs look like vs waiting until we can completely switch (build time, binary size, error messages, behavior differences, etc).
I appreciate the pressure to look at the data on this. My null hypothesis is that gitoxide
will bring improvements even if we have to have places where we fall back to libgit2
.
For example, it sounds like creating a working tree is harder than doing a clone, I wonder if we could use gitoxide
to do a shallow and bare clone of dep dependencies, and then use libgit2
to actually create the working tree.
The Cargo Team should make a higher bandwidth communication channel with you to fully support the work in your grant!
@Byron would this only be replacing libgit2 for clones but continue to use it for everywhere else or is this the last item needed to completely switch to gitoxide?
It would only replace it for clones and fetches, basically all network operations and worktree updates. Indeed git2
would remain everywhere else in the codebase. I am tracking a probably incomplete list of capabilities that gitoxide
would need to support to fully replace git2
. The latter is also my goal, and I would expect this to happen in a follow-up grant that I will be applying for.
That said, if there is time in this grant period I will be happy to make use of gitoxide
in as many places where possible based on your guidance, and would expect the API to be more comfortable at least while also being faster in the common case.
And to be honest, grant or not, I know I won't rest until gitoxide
fully replaces git2
and is meaningfully 'better' in more ways than just being 'pure Rust'.
If we need to still use libgit2, I wonder what the trade offs look like vs waiting until we can completely switch (build time, binary size, error messages, behavior differences, etc).
I think figuring out how to perform the initial integration where git2
and gitoxide
have to be present in parallel is part of what we would come up with before any integration work can start. The way I see it and from my experience with cargo
, both code-paths would be compiled in at all times and can be switched between. What I would want to avoid is to have to maintain a patch queue for a year until the transition is complete π
. In any case, it's probably worth collecting some data first to see if the overhead incurred to binary size and build/CI times is acceptable for the duration of the transition.
Last but not least, something I forgot to mention: I plan to maintain the gitoxide
integration until it reaches 1.0, so things like breaking changes you won't have to deal with as I will perform the upgrades.
@Eh2406
I appreciate the pressure to look at the data on this.
To clarify, I don't think a deep analysis is needed. However, in the past, we have cared about dependencies being added (e.g. clap
's derive
feature) and about consistent behavior when having two dependencies fulfilling a similar role (toml
and toml_edit
). Overall, I see this as something that both the Cargo team and Byron would work together to figure out what is the acceptable set of trade offs if we can't exclusively switch.
@Byron I am very excited for your grant! I look forward to us working together!
Thank you! And likewise :).
We should be able to perform the initial clone of the crates index in a shallow fashion. Doing so is only technically related and not part of this issue at all, thus it's something we would need to sort out beforehand.
From our conversations with GitHub, doing a shallow clone is fine, but then attempting to do a fetch from that clone creates a lot of work on their servers. Shallow cloning the index is not a viable solution. #9069 is actively making progress to removing git from the index entirely. Happy to talk more about the details, but as you pointed out it's off-topic for this issue.
Before posting here I dug into the tree of links spreading out from this issue and also noticed that shallow clones and fetches were discouraged. From what I could fathom, it all boils down to this comment by a GitHub engineer 6 years ago. From my shallow (ππ) understanding of how shallow clones work I believe it's critical to assure the future fetches are correctly implemented so that the fetch will only include the changes, very similar what happens to fetches into a non-shallow repository. Besides the fetch protocol having special handling of shallow commits I don't see why the server would be slower if the client is correctly implemented. Furthermore the GitHub engineer talked about a series of patches to land that would reduce waste on the server side when handling shallow fetches, and I assume these landed by now. They also highlighted that the client was doing a non-shallow fetch after the first shallow clone which defeats the purpose.
That said, I am happy to assume shallow fetches aren't viable, but hope that the previous paragraph can be a reason to re-validate that assumption.
Using gitoxide
for clones and fetches of the crates.io index, shallow or not, still seems to be a viable mid-term goal though, and I think it's valuable as gitoxide
can resolve packs faster thus speeding up the clone.
I'd consider shallow clones of the crates index as a stepping stone for bare clones as well, which seem to be able to save an additional 700MB on disk space right off the bat and will definitely speed up initial clones further.
I believe we already do a bear clone of the index. Is that what you're referring to?
Thanks for pointing that out! After removing my possibly years old index and running cargo again, it indeed created a checkout that looks more like a bare clone. The repository isn't 'officially' bare as bare = false
can be seen in the .git/config
file, but I assume cargo uses it as if it was bare anyway - in any case, there is no worktree checkout anymore.
This idea can certainly be discarded then.
For example, it sounds like creating a working tree is harder than doing a clone, I wonder if we could use
gitoxide
to do a shallow and bare clone of dep dependencies, and then uselibgit2
to actually create the working tree.
Doing so could be a viable alternative mid-term goal if any work on using gitoxide
for cloning and fetching the crates index should be avoided, as it would allow skipping worktree checkouts at first. I'd even go as far as to say that gitoxide
can realistically checkout the initial clone but delegate submodule updates to git2
at first.
After diving into cargo's git related code a little more where conveniently most of the relevant one seems to be in sources/git/utils.rs
I came up with a more fine-grained and hopefully more realistic approach to tackling this issue.
The general assumption is that gitoxide
will be put behind a feature toggle (or similar), along with some options to control more precisely where shallow clones are used, i.e. crates only, or crates + index. Whenever gitoxide
can't be used due to lack of feature parity with git2
, git2
is used instead which will work just like it does before (special care will be taken to assure we don't have to repos loaded at any time, which would easily double the amount of system resources required).
There are four distinct integration steps, with the first two being sufficient to fulfill the mid term and grant goal. The last two are more like stretch goals that I did have in mind when applying for the grant and that I personally want to see realized this year.
gitoxide
for git::fetch(β¦)
resolving objects
stageshallow
fetches (and configure if this also applies to the crates index)
gitoxide
does correct and bit-for-bit-perfect worktree checkouts if compared to git
git
if this is any indicationgitoxide
can checkout submodules as well, and shallow
settings apply when fetching these
shallow
fetches, faster pack resolution and checkoutsshallow
fetches fasterHidden in the listing above is some flexibility in implementation detail as it's possible to use standard transports (git+ssh
, https
) along with standard authentication that are already implemented in gitoxide
to probably cover 90% of the cases to get to implementing shallow
fetch negotiations more quickly. Then, between 2) and 3) one could take the time to truly cover all the details of transports and credentials (for instance, git2
seems to support NTLM and kerberos, and uses libssh2
instead of the ssh
binary like gitoxide
does currently, similar to git
). This is really up to our preferences here, otherwise I'd expect transport and credentials parity to happen with 1).
Runtimes of cargo
that involve a lot of git interactions should be reduced visibly. Running the cargo benchmarks seems like a good baseline as it will clone a lot of git repositories. All of this should get much faster - 38m23s is the time to beat for a plain cargo bench
.
This write-up is the summary of my research issue and I hope it helps to come up with the next steps to take for integrating gitoxide
with cargo
. It might also a starting point for tackling other questions like:
gitoxide
code-path be considered ready for being used by default?gitoxide
to avoid having to build with gitixode
and git2
?gitoxide
be used to replace the two git
command invocations currently present? (it already could be used for both of them today)I am definitely looking forward to your feedback :), thanks a lot!
Should
gitoxide
be used to replace the twogit
command invocations currently present? (it already could be used for both of them today)
This would be nice... having executables execute other executables smells.
This would be nice... having executables execute other executables smells.
I think both invocations are well-motivated. The one in build.rs
is probably to avoid increasing build times, which is always built for the host and thus may lead to git2
/gitoxide
being built multiple times in case of --target
specifications. As git
is typically installed on a machine tasked to build cargo, it's the fastest and way to get some basic information out of a git repository. That's probably why git2
isn't used for this even though it could have been.
The other use is git gc
with the sole purpose of aggregating various packs into one pack file, which isn't easily supported by git2
. Since this is actual cargo code (as opposed to build.rs
), using gitoxide
there should be a great advantage as it would assure this maintenance happens wherever cargo is executed independently of whether git is available in the PATH or not. I imagine this positively affects windows users in particular.
What I like about this second opportunity is that it is high-value yet easy to accomplish, and all that's needed to enable a PR is a pre-existing feature toggle to allow switching gitoxide
on in the code base - it could be some sort of warm up to the bigger operation proposed here. Doing so could motivate the feature toggle that future PRs will also use (see the 4-step plan above for what these could look like).
Note that upstream shallow cloning functionality is in final review stages over at libgit2/libgit2#6396. Perhaps an interested party can implement this using a patched version + PR here so that when it lands it's ready to go?
It looks like shallow-cloning on libgit2
side has landed ππ
So this issue is actually unblocked now
I'm interested in taking on this issue as a first contribution.
My initial thoughts are to move the -Zgitoxide=shallow-index,shallow-deps
flag to a new -Zgit=shallow-index,shallow-deps
flag that applies to both gitoxide and libgit2. One pain point I've discovered in testing is that libgit2 doesn't support local shallow fetches. Otherwise, it's a pretty straightforward change.
Thanks for sharing! This reminded me to keep track of the shallow
requirement in the respective gitoxide
ticket.
Besides that, I really like the generalization of the shallow-*
feature.
Amazing, thanks everyone for getting this landed! Exciting :)
Servo has a dep on Skia, which takes a very long time to download. Aside from appearing to hang (since Cargo doesn't show a progress message), this is a rather unnecessary delay for getting a new build running.
It would be nice if we could specify a
--depth
argument in the.config
. (most of the size is due to it's extensive commit history)Alternatively, a default
--depth 50
would work too, though I'm not sure how well that would interact withcargo update
.Status
Blocked on https://github.com/libgit2/libgit2/issues/3058