rust-lang / cargo

The Rust package manager
https://doc.rust-lang.org/cargo
Apache License 2.0
12.63k stars 2.4k forks source link

Allow specifying shallow clones in .config #1171

Closed Manishearth closed 8 months ago

Manishearth commented 9 years ago

Servo has a dep on Skia, which takes a very long time to download. Aside from appearing to hang (since Cargo doesn't show a progress message), this is a rather unnecessary delay for getting a new build running.

It would be nice if we could specify a --depth argument in the .config. (most of the size is due to it's extensive commit history)

Alternatively, a default --depth 50 would work too, though I'm not sure how well that would interact with cargo update.

Status

Blocked on https://github.com/libgit2/libgit2/issues/3058

alexcrichton commented 9 years ago

Note that this is blocked on https://github.com/libgit2/libgit2/issues/2782.

alexcrichton commented 9 years ago

And to clarify some more, I would love to implement this! The backing libgit2 support is a prerequisite, however (see https://github.com/libgit2/libgit2/issues/1430 as well)

larsbergstrom commented 9 years ago

Both of those blocked issues in libgit are now fixed :-)

alexcrichton commented 9 years ago

Ah unfortunately they were closed in favor of a new metabug in libgit2: https://github.com/libgit2/libgit2/issues/3058 I'll update the description accordingly

larsbergstrom commented 9 years ago

Darn, got my hopes up :-)

larsbergstrom commented 8 years ago

This issue is less important to Servo now that crates.io supports larger crate sizes and we are able to shift more of our dependency graph from raw git to crates.io.

cuviper commented 7 years ago

This would be nice for crates.io-index too, not just git dependencies. A current full clone of the index takes 39MB network and 48MB .git/ on disk, whereas --depth 1 is only 4.5MB network and 6MB on disk.

alexcrichton commented 7 years ago

@cuviper oh it's worth pointing out that for the index at least we planned on this happening over time (clone being large) and we wanted to protect against that. It should be implemented that at any time we can roll the entire index into one commit (e.g. one huge squash) and force push that up. That, in effect, should reset everyone to a "shallow clone".

We've never done that yet so we don't quite have a procedure and/or threshold to do that, but I should probably verify that we can actually do that soon :). Other than that all we'd need to do is briefly turn off crates.io, do the squash, force push, and then turn crates.io back. All instances of Cargo should then pick up the new commit and hard reset to it when they next need to update the registry.

cuviper commented 7 years ago

Hmm, will the history be saved to a branch before the reset? I'm not sure it's important, but it seems like we shouldn't just throw it all away. And FWIW git has a --single-branch option (implied by --depth) which we would want to use to get only the squashed master. It seems libgit2 can also do this by adjusting the remote refspec.

alexcrichton commented 7 years ago

Ah excellent point, I think we'd definitely want a backup somewhere. I don't personally know a huge amount about git refspecs, but my gut is that Cargo is cloning all branches of the index today (and all historical versions of Cargo). I think it'd be easy to update that though to only fetch master!

cuviper commented 7 years ago

Yeah, the refspec just becomes "refs/heads/master:refs/remotes/origin/master".

alexcrichton commented 7 years ago

I've filed an issue to update that

cztomsik commented 4 years ago

Hey, any changes to this? I'm depending on my own forks of glfw & yoga which I'd like to rather not publish to crates.io yet - fetching it takes considerable time.

Kerollmops commented 4 years ago

I think the progress of this issue is completely related to https://github.com/libgit2/libgit2/pull/5254, because IIRC cargo use the git2 rust library and it has recently made progress on shallow clones.

DCjanus commented 2 years ago

I think the progress of this issue is completely related to libgit2/libgit2#5254, because IIRC cargo use the git2 rust library and it has recently made progress on shallow clones.

@Kerollmops

Actual, there is more than that, since GitHub limited shallow clone on large repository, I don't think this would be work even git2 support shallow clone.

There is an related RFC for pure HTTP registry index, maybe that would be a solution.

milahu commented 2 years ago

example in the wild:

downstream use case:

in nixpkgs there is rustPlatform.fetchCargoTarball which only needs shallow clones

deep clones cannot be shared between fetches, so deep clones actually increase the server load

Shallow clones are discouraged by github as they increase load on their servers (source)

in nixpkgs, many packages use fetchFromGitHub which uses github's archive API for better performance than git clone

the downside of using the archive API is:

there is no simple way to verify the files by the commit hash, because the commit object is *not* included in the download (could be added as http header), and the github repos API is lossy (timezones are missing)

tree hash of files + commit object → commit hash

as workaround, fetchFromGitHub requires an additional sha256 hash to verify files

example

c=d41f5ccd7ea91afee4f1a9d20b85dbcede135d3b

git clone --depth 10 https://github.com/rust-lang/cargo
git checkout $c
git log | head -n3 | grep ^Date
# Date:   Wed Mar 2 08:41:44 2022 -0600
# -> timezone is -0600

curl https://api.github.com/repos/rust-lang/cargo/commits/$c | jq .commit.author.date
# 2022-03-02T14:41:44Z
# -> no timezone

curl -I -L https://github.com/rust-lang/cargo/archive/$c.tar.gz
# -> no commit object in http headers

(maybe someone with a better reputation than me could convince github/gitlab/gitea to implement this ...)

Ralith commented 2 years ago

there is no simple way to verify the files by the commit hash

It is however possible to fetch by the commit hash, so maybe this is fine? If you ask for commit $FOO over TLS, then then the result can be trusted insofar as that github is.

milahu commented 2 years ago

It is however possible to fetch by the commit hash

yepp, but its 2x slower than github's archive API

mkdir d
cd d
git init
git remote add o https://github.com/postgres/postgres
time git fetch o 5b68f75e12831cd5b7d8058320c0ca29bbe76067 --depth 1
# real  0m13.084s
time git checkout 5b68f75e12831cd5b7d8058320c0ca29bbe76067
# real  0m3.814s

time wget https://github.com/postgres/postgres/archive/5b68f75e12831cd5b7d8058320c0ca29bbe76067.tar.gz 
# real  0m6.431s

ask for commit $FOO over TLS

TLS/SSL does not help here, because the data is already content-addressed

Ralith commented 2 years ago

Your example is of fetching a specific commit hash over the archive API. TLS helps because it ensures that the data is authentic and not corrupt.

nazar-pc commented 2 years ago

There is https://github.com/Byron/gitoxide project, which aims to implement Git in pure Rust.

It seems to be quite far along and maybe can be used for at least a subset of use cases in the meantime? It is also under active development. I'd be happy to put a few bucks into development that would remove ~10 minutes from each CI run.

cc @Byron just in case.

Byron commented 2 years ago

Thanks for the CC! I was aiming at one day find a way into cargo and created this tracking issue for additionally oxidization.

There is a lot of work still to be done, and even though gitoxide deals with shallow histories gracefully by default and can receive packs that are shallow, there is a lot more around transport and credentials that needs to be carefully ported to achieve feature parity with git2 and support shallow clones.

That said, I also plan to apply for a Rust foundation grant this month and think that bringing early gitoxide support for cargo with focus on resolving this issue would be great value for everyone.

nazar-pc commented 2 years ago

What about supporting shallow clone in conjunction with net.git-fetch-with-cli=true at least?

Byron commented 2 years ago

More than 7 years later this issue is still blocked on libgit2, and I think it's time to change the approach from waiting to switching to something else :).

I am happy to announce that the Rust Foundation provided a grant to resolve this issue specifically by the end of this year, and I am here to kick it off by starting a conversation.

Grant Goals and Mid-Term

Due to the structure of the grant there already is a frame suggested by me which says that in about three months, we should be able to perform the initial clone of the crates index in a shallow fashion. Doing so is only technically related and not part of this issue at all, thus it's something we would need to sort out beforehand. In my initial naivetΓ© I picked this as mid-term goal because it would require the following capabilities from gitoxide

Doing so seems to make sense as well as shallow vs non-shallow clones of the crates index are 5.3x faster and consume only 25% of the disk space for .git itself (the related research is in my research). I'd consider shallow clones of the crates index as a stepping stone for bare clones as well, which seem to be able to save an additional 700MB on disk space right off the bat and will definitely speed up initial clones further.

From there, the 'only' thing missing to resolve the issue of shallow clones for crates, merely from the perspective of capabilities in gitoxide, would be to perfectly checkout and potentially update worktrees, which can be quite involved thanks to git-attributes and git-submodules.

Next Steps

I will be able to move freely related to the implementation of required capabilities in gitoxide and will proceed with the implementation once the initial analysis is completed.

Integrating gitoxide into the relevant parts of cargo requires alignment which I hope we can drive forward here or as part of any process that may be more suited for that.

For me it would be most helpful if I could get guidance on the grants mid-term which has gitoxide create shallow clones of the crates index. Is it feasible and if so how to best proceed from here, or what could other mid-term goals look like alternatively? I would feel comfortable the integration could start within 6 weeks for now, leaving about 6 weeks to drive it to completion with availability in nightly, maybe.

Thanks a lot for your help :)!

epage commented 2 years ago

@Byron would this only be replacing libgit2 for clones but continue to use it for everywhere else or is this the last item needed to completely switch to gitoxide?

If we need to still use libgit2, I wonder what the trade offs look like vs waiting until we can completely switch (build time, binary size, error messages, behavior differences, etc).

Eh2406 commented 2 years ago

@Byron I am very excited for your grant! I look forward to us working together!

We should be able to perform the initial clone of the crates index in a shallow fashion. Doing so is only technically related and not part of this issue at all, thus it's something we would need to sort out beforehand.

From our conversations with GitHub, doing a shallow clone is fine, but then attempting to do a fetch from that clone creates a lot of work on their servers. Shallow cloning the index is not a viable solution. #9069 is actively making progress to removing git from the index entirely. Happy to talk more about the details, but as you pointed out it's off-topic for this issue.

I'd consider shallow clones of the crates index as a stepping stone for bare clones as well, which seem to be able to save an additional 700MB on disk space right off the bat and will definitely speed up initial clones further.

I believe we already do a bear clone of the index. Is that what you're referring to?

If we need to still use libgit2, I wonder what the trade offs look like vs waiting until we can completely switch (build time, binary size, error messages, behavior differences, etc).

I appreciate the pressure to look at the data on this. My null hypothesis is that gitoxide will bring improvements even if we have to have places where we fall back to libgit2. For example, it sounds like creating a working tree is harder than doing a clone, I wonder if we could use gitoxide to do a shallow and bare clone of dep dependencies, and then use libgit2 to actually create the working tree.

The Cargo Team should make a higher bandwidth communication channel with you to fully support the work in your grant!

Byron commented 2 years ago

@Byron would this only be replacing libgit2 for clones but continue to use it for everywhere else or is this the last item needed to completely switch to gitoxide?

It would only replace it for clones and fetches, basically all network operations and worktree updates. Indeed git2 would remain everywhere else in the codebase. I am tracking a probably incomplete list of capabilities that gitoxide would need to support to fully replace git2. The latter is also my goal, and I would expect this to happen in a follow-up grant that I will be applying for.

That said, if there is time in this grant period I will be happy to make use of gitoxide in as many places where possible based on your guidance, and would expect the API to be more comfortable at least while also being faster in the common case.

And to be honest, grant or not, I know I won't rest until gitoxide fully replaces git2 and is meaningfully 'better' in more ways than just being 'pure Rust'.

If we need to still use libgit2, I wonder what the trade offs look like vs waiting until we can completely switch (build time, binary size, error messages, behavior differences, etc).

I think figuring out how to perform the initial integration where git2 and gitoxide have to be present in parallel is part of what we would come up with before any integration work can start. The way I see it and from my experience with cargo, both code-paths would be compiled in at all times and can be switched between. What I would want to avoid is to have to maintain a patch queue for a year until the transition is complete πŸ˜…. In any case, it's probably worth collecting some data first to see if the overhead incurred to binary size and build/CI times is acceptable for the duration of the transition.

Last but not least, something I forgot to mention: I plan to maintain the gitoxide integration until it reaches 1.0, so things like breaking changes you won't have to deal with as I will perform the upgrades.

epage commented 2 years ago

@Eh2406

I appreciate the pressure to look at the data on this.

To clarify, I don't think a deep analysis is needed. However, in the past, we have cared about dependencies being added (e.g. clap's derive feature) and about consistent behavior when having two dependencies fulfilling a similar role (toml and toml_edit). Overall, I see this as something that both the Cargo team and Byron would work together to figure out what is the acceptable set of trade offs if we can't exclusively switch.

Byron commented 2 years ago

@Byron I am very excited for your grant! I look forward to us working together!

Thank you! And likewise :).

We should be able to perform the initial clone of the crates index in a shallow fashion. Doing so is only technically related and not part of this issue at all, thus it's something we would need to sort out beforehand.

From our conversations with GitHub, doing a shallow clone is fine, but then attempting to do a fetch from that clone creates a lot of work on their servers. Shallow cloning the index is not a viable solution. #9069 is actively making progress to removing git from the index entirely. Happy to talk more about the details, but as you pointed out it's off-topic for this issue.

Before posting here I dug into the tree of links spreading out from this issue and also noticed that shallow clones and fetches were discouraged. From what I could fathom, it all boils down to this comment by a GitHub engineer 6 years ago. From my shallow (πŸŒπŸ‘) understanding of how shallow clones work I believe it's critical to assure the future fetches are correctly implemented so that the fetch will only include the changes, very similar what happens to fetches into a non-shallow repository. Besides the fetch protocol having special handling of shallow commits I don't see why the server would be slower if the client is correctly implemented. Furthermore the GitHub engineer talked about a series of patches to land that would reduce waste on the server side when handling shallow fetches, and I assume these landed by now. They also highlighted that the client was doing a non-shallow fetch after the first shallow clone which defeats the purpose.

That said, I am happy to assume shallow fetches aren't viable, but hope that the previous paragraph can be a reason to re-validate that assumption.

Using gitoxide for clones and fetches of the crates.io index, shallow or not, still seems to be a viable mid-term goal though, and I think it's valuable as gitoxide can resolve packs faster thus speeding up the clone.

I'd consider shallow clones of the crates index as a stepping stone for bare clones as well, which seem to be able to save an additional 700MB on disk space right off the bat and will definitely speed up initial clones further.

I believe we already do a bear clone of the index. Is that what you're referring to?

Thanks for pointing that out! After removing my possibly years old index and running cargo again, it indeed created a checkout that looks more like a bare clone. The repository isn't 'officially' bare as bare = false can be seen in the .git/config file, but I assume cargo uses it as if it was bare anyway - in any case, there is no worktree checkout anymore.

This idea can certainly be discarded then.

For example, it sounds like creating a working tree is harder than doing a clone, I wonder if we could use gitoxide to do a shallow and bare clone of dep dependencies, and then use libgit2 to actually create the working tree.

Doing so could be a viable alternative mid-term goal if any work on using gitoxide for cloning and fetching the crates index should be avoided, as it would allow skipping worktree checkouts at first. I'd even go as far as to say that gitoxide can realistically checkout the initial clone but delegate submodule updates to git2 at first.

Byron commented 2 years ago

After diving into cargo's git related code a little more where conveniently most of the relevant one seems to be in sources/git/utils.rs I came up with a more fine-grained and hopefully more realistic approach to tackling this issue.

assumptions

The general assumption is that gitoxide will be put behind a feature toggle (or similar), along with some options to control more precisely where shallow clones are used, i.e. crates only, or crates + index. Whenever gitoxide can't be used due to lack of feature parity with git2, git2 is used instead which will work just like it does before (special care will be taken to assure we don't have to repos loaded at any time, which would easily double the amount of system resources required).

integration steps

There are four distinct integration steps, with the first two being sufficient to fulfill the mid term and grant goal. The last two are more like stretch goals that I did have in mind when applying for the grant and that I personally want to see realized this year.

  1. Use gitoxide for git::fetch(…)
    • value: fetches get significantly faster in the resolving objects stage
  2. Add support for shallow fetches (and configure if this also applies to the crates index)
    • value: greatly reduced bandwidth and storage requirements for initial clones; resolve this issue.
  3. gitoxide does correct and bit-for-bit-perfect worktree checkouts if compared to git
  4. gitoxide can checkout submodules as well, and shallow settings apply when fetching these
    • value: submodules benefit from all of the above - shallow fetches, faster pack resolution and checkouts
getting to shallow fetches faster

Hidden in the listing above is some flexibility in implementation detail as it's possible to use standard transports (git+ssh, https) along with standard authentication that are already implemented in gitoxide to probably cover 90% of the cases to get to implementing shallow fetch negotiations more quickly. Then, between 2) and 3) one could take the time to truly cover all the details of transports and credentials (for instance, git2 seems to support NTLM and kerberos, and uses libssh2 instead of the ssh binary like gitoxide does currently, similar to git). This is really up to our preferences here, otherwise I'd expect transport and credentials parity to happen with 1).

measures of success

Runtimes of cargo that involve a lot of git interactions should be reduced visibly. Running the cargo benchmarks seems like a good baseline as it will clone a lot of git repositories. All of this should get much faster - 38m23s is the time to beat for a plain cargo bench.

call to action and the bigger picture

This write-up is the summary of my research issue and I hope it helps to come up with the next steps to take for integrating gitoxide with cargo. It might also a starting point for tackling other questions like:

I am definitely looking forward to your feedback :), thanks a lot!

tshepang commented 2 years ago

Should gitoxide be used to replace the two git command invocations currently present? (it already could be used for both of them today)

This would be nice... having executables execute other executables smells.

Byron commented 2 years ago

This would be nice... having executables execute other executables smells.

I think both invocations are well-motivated. The one in build.rs is probably to avoid increasing build times, which is always built for the host and thus may lead to git2/gitoxide being built multiple times in case of --target specifications. As git is typically installed on a machine tasked to build cargo, it's the fastest and way to get some basic information out of a git repository. That's probably why git2 isn't used for this even though it could have been.

The other use is git gc with the sole purpose of aggregating various packs into one pack file, which isn't easily supported by git2. Since this is actual cargo code (as opposed to build.rs), using gitoxide there should be a great advantage as it would assure this maintenance happens wherever cargo is executed independently of whether git is available in the PATH or not. I imagine this positively affects windows users in particular.

What I like about this second opportunity is that it is high-value yet easy to accomplish, and all that's needed to enable a PR is a pre-existing feature toggle to allow switching gitoxide on in the code base - it could be some sort of warm up to the bigger operation proposed here. Doing so could motivate the feature toggle that future PRs will also use (see the 4-step plan above for what these could look like).

Qix- commented 1 year ago

Note that upstream shallow cloning functionality is in final review stages over at libgit2/libgit2#6396. Perhaps an interested party can implement this using a patched version + PR here so that when it lands it's ready to go?

pwnorbitals commented 11 months ago

It looks like shallow-cloning on libgit2 side has landed πŸŽ‰πŸŽ‰ So this issue is actually unblocked now

connorworley commented 9 months ago

I'm interested in taking on this issue as a first contribution.

connorworley commented 9 months ago

My initial thoughts are to move the -Zgitoxide=shallow-index,shallow-deps flag to a new -Zgit=shallow-index,shallow-deps flag that applies to both gitoxide and libgit2. One pain point I've discovered in testing is that libgit2 doesn't support local shallow fetches. Otherwise, it's a pretty straightforward change.

Byron commented 9 months ago

Thanks for sharing! This reminded me to keep track of the shallow requirement in the respective gitoxide ticket.

Besides that, I really like the generalization of the shallow-* feature.

Qix- commented 8 months ago

Amazing, thanks everyone for getting this landed! Exciting :)