Open Eh2406 opened 5 years ago
The actual script:
only requires push access to the crates.io-index, which any admin of the rust-lang GitHub organization has (and probably more).
I think it'd be best to do some measurements here directly correlated with the metrics we care about. The original rationale for squashing was that initial clones took quite a long time downloading so much history. As a result I would suspect that we should establish thresholds along the lines of "how big is the download and how much would we save with a squash"?
which any admin of the rust-lang GitHub organization has (and probably more).
I'm also able to do it, and bors of course can (dunno if bors is an admin). I think that's it though.
This was discussed at the crates.io meeting. Here were the key points.
The main unresolved questions, which we'd like to get answers from the Cargo team on, are:
My personal answer to those questions, which does not represent consensus among any team(s) are:
A follow up to @joshtriplett suggestion.
To clone the index as is
git clone -b master --single-branch https://github.com/rust-lang/crates.io-index.git
downloads 61.9MiB
Then to fetch the squash that I made is
git fetch https://github.com/Eh2406/crates.io-index.git master
does not redownload the data!
If I dell that checkout, and clone the index from my squash
git clone -b master --single-branch https://github.com/Eh2406/crates.io-index.git
downloads 17.26MiB
So apparently we can get git to do this correctly! (Others should check if they are getting the same results.) The thing I tried https://github.com/Eh2406/crates.io-index/commit/65419fd5f5b9758b95fa08f207276639b1426e43 is to add a new squash commit on top of the existing one from last time. I did not make a script just did it manually. It may be sufficient to just share the same root commit, if someone wants to give that a try.
Looks like it works with the root in common, using git fetch https://github.com/Eh2406/crates.io-index.git test
.
The root can be found with root = git rev-list --max-parents=0 HEAD
Then the penultimate line can be new_rev=$(git commit-tree HEAD^{tree} -m "$msg" -p $root)
And everything should work.
For my own personal takes on some of the unresolved questions:
Are we ok with completely automating the whole process, and therefore losing the ability to communicate beforehand?
I don't have any problem with losing communication about this, I don't think it's really all that important especially now that it went so smoothly the first time. I do have a slightly different concern though. I think it would be a failure mode of Cargo if the index were automatically rolled up every day (defeating the purpose of delta updates), and having a fully automated process may cause us to not realize we're getting close to that situation.
I am, however, very much in favor of automation. So to allay my concern I would request that a notification of some form be sent out to interested team members when a squash happens. (aka I just want an email of some form)
Should the threshold be time based or commit based?
I would personally measure this in megabytes of data to download rather then either metric you mentioned, but commits are likely a good proxy for the megabytes being downloaded. My ideal metric would be something like "we shave 100MB off a clean download of the index", and the 100 number there is pulled out of thin air and could be more like 50 or something like that.
What should the threshold be?
I think the first index squash went from roughly 90MB to 10MB (ish) for a clean initial download. Along those lines I'd say that a squash should save at least 70MB before squashing.
I think it would be a failure mode of Cargo if the index were automatically rolled up every day (defeating the purpose of delta updates)
@alexcrichton One question, if git can download a roll up in O(delta)
work would you still think this is a failure mode?
AFAIK git just downloads objects and doesn't do any diffing at the fetch layer. Delta updates work because most indexes have a huge shared history. If we roll into one commit frequently there's no shared history so git will keep downloading the entire new history, which would be fresh each time.
So to answer your question, I don't believe git can have any sort of delta update when the history is changed and so I would still consider it a failure mode.
For users who already have the latest version of the index, Git will generally see that the tree object for the single squashed commit is identical to the tree object it already has (since it has the same hash), so it will only donwload the single new commit object.
So another solution may be to always keep, say, the last month's worth of commits in the history, and only squash the bits that are older than one month. All users who have updated in the month before squashing will be able to download deltas, and only users with an even older version of the index will have to redownload it in full.
When squashing the old commits, all commits on top of them will have to be rewritten, so users will have to redownload the commit objects. However, commit objects hardly contain any data, and the associated tree objects are identical, so they won't be retransmitted.
I did some experiments for this approach, and got somewhat mixed results with what Git is able to detect, but I believe it is possible to make it work. It would require some work to figure out the details, though.
We had some discussion in the crates.io Discord channel (can't figure out how to permalink it), and things aren't quite as easy as indicated in my previous comment. I may have time to do some experiments later this week, but I don't make any promises.
link to the discussion: https://discordapp.com/channels/442252698964721669/448525639469891595/597888610376613901
We did not have time to discuss this at the Cargo meeting today. So we don't have any new answers for @sgrif.
I would request that a notification of some form be sent out
I was thinking maybe we open and issue on the index repo and have the script add a comment there, then anyone interested (in teams or not) can subscribe to that issue to get notifications. I would want to look into @nemo157 suggestions for how to get git not to download the history at all well before we start doing a squash every week.
I think the first index squash went from roughly 90MB to 10MB (ish)
>git clone -b master --single-branch https://github.com/rust-lang/crates.io-index.git
...
Receiving objects: 100% (297740/297740), 67.54 MiB | 5.79 MiB/s, done.
>git clone -b master --single-branch https://github.com/smarnach/crates.io-index
Cloning into 'crates.io-index'...
...
Receiving objects: 100% (36539/36539), 14.01 MiB | 5.75 MiB/s, done.
So it looks like we save ~54 MiB today. Assuming a linear size per commit then we would hit 70 MiB saved at ~ 72K Commits. So it looks like people's instincts are approximately in the same ballpark.
It sounds like we don't need to keep a window of commits on the main branch, and we just need to archive the squashed-away commits on an archive branch? And since the server has those available it can do deltas from those objects? That sounds perfect.
We discussed this at the Cargo meeting today.
The main unresolved questions, which we'd like to get answers from the Cargo team on, are:
- Are we ok with completely automating the whole process, and therefore losing the ability to communicate beforehand?
Yes! Several of us would like some form of notification when it happens, but it does not need to be in advance and we do not need to publicize the event.
- Should the threshold be time based or commit based?
We realized that it was hard to make a decision do to a bikeshed effect, we all had different opinions but not strong enough to convince anyone. So we decided whatever is easiest for you to set up. If you need someone to make a decision, A daly check if we are over the commit limit.
- What should the threshold be?
After some discussion @ehuss pointed out that it is already noticeable, and @nrc pointed out that we want to have the script do something the first time it runs. We don't want it to break things on some random day in 3 month when we have non of this paged in. So if it is time based then every 6 months, if it is commit based then 50k. Most importantly We can monitor it and adjust the threshold later if needed.
We had some discussion of whether this will cause existing users to download the full index on each squash day. My understanding from our discussion with @Nemo157 and @smarnach on discord is that the current plan will not trigger a full download. The Github repo will always have a commit referencing all tree objects that the client will have, so Github will have what it needs to do a delta even when master has just been squashed. No git-gc can remove the tree objects as there used by a backup branch. @ehuss wanted to recheck to make sure that this works as hoped.
Will move forward with a prototype that squashes when the commit count is >50k
I've been doing some tests, and Alex's original script seems to work pretty well. I've tried with a copy fetched by cargo that is anywhere from 10 to 1,000 to 10,000 commits old, and it seemed to properly download just the minimum necessary.
A fresh download (delete CARGO_HOME) from a squashed index is about a 15MB download, which uses about 16MB of disk space. Compare that to the current size which is about 73MB download using about 79MB of disk space.
The only issue I see is that for existing users, it does not release the disk usage. The only way I've determined to delete the old references is to run:
git reflog expire --expire=now --all
git gc --prune=now
Cargo currently has a heuristic where it automatically runs git gc
occasionally. Perhaps it could be extended to run the above commands? It could be a big win for disk usage. What do people think?
I'd be totally down for expanding Cargo's gc commands, and if Cargo can share indexes even across squashes that's even better!
@ehuss looks (https://git-scm.com/docs/git-reflog) like the git gc
dose a --expire=90days
by default and we can change the gc.reflogExpire
config to set a shorter duration.
@sgrif what is the progress on the prototype?
@sgrif this recently came up again on internals, wanted to ping again if you've got progress on a prototype?
I don't mind running the script manually nowadays one more time before we get automation set up again. If I don't hear back from you in a week or so I'll go ahead and do that and we can continue along the automation track!
Ok I briefly talked with @sgrif on IRC and the index has been squashed! We'll be sure to have automation for the next one :)
It looks like the index has grown considerably since the last squash (looks like it is 75MB now, and can be squashed down to about 20MB). @rust-lang/crates-io is there any progress on automating the process? Is there anything I can do to help? If there are barriers to setting up a cron job, can someone run the script manually?
I've re-squashed the index
When you squash the index in the future, are you able squash it for, as an example, everything older than 1 week instead of every commit in the repo at the time its squashed?
I only ask because I currently am using the commit history as a changes feed for the crates index and if all commits are squashed one day, i would potentially lose any changes since the last time my automated process checked the commit history. This would give me a week buffer to run it before losing any information
I don't think so. A commit with a long history does not have the same hash as a commit with 1 week of history. So if you only walk master, your just going to see new commits that happen to do the same thing as the old commits but are not equal. The code to handle that, may as well be code to walk the backup branches, feels like the same level of complexity.
If you just compare the trees rather than walking commits it should work fine (e.g. from looking at the code I think crates-index-diff
should work fine across a squash, and I don't recall docs.rs which uses it having any issues around March).
Looks like it may be that time once again.
Looks like it may be that time once again.
This was last squashed on 2020-08-04, so we will need to automate the squashing if we're looking at doing this every few months.
I've done a squash, reducing the size from 80 to 30 MB.
I would like to move this forward. The index has gotten quite large again and it takes a long time to download.
I'd like to propose running a cron job from GitHub Actions which will squash the index when it crosses a threshold (I'm proposing 50,000 commits).
Due to the way GitHub Actions cron jobs work, they can only be triggered from the default branch of a repository. We would prefer to not do that in any of the existing repos, so a new repo will need to be created to house the script.
I have created a prototype at https://github.com/ehuss/crates.io-index-squash. It contains a simple shell script which squashes the index. It runs once a day, and can be manually triggered.
I decided to use SSH keys since their scope can be narrowed more easily than auth tokens can. It can be easily changed to an auth token if people prefer.
The steps to make this live are:
rust-lang
.crates.io-index
:
ssh-keygen -t ed25519 -C "your_email@example.com"
id_ed25519
in local directory.rust-lang/crates.io-index
, go to Settings > Deploy Keys, and add a new key.
id_ed25519.pub
CRATES_IO_SSH_KEY
with the contents of id_ed25519
.@Mark-Simulacrum or @PietroAlbini, would either of you be willing to help make this happen? Or if you would prefer a different approach, I'm willing to help.
This looks reasonable to me - it may make sense to have the script and GHA config live in the simpleinfra repository, rather than a dedicated one (just for ease of having things in one place). I'm not enthusiastic about triggering it on a cronjob (once per day), but it seems OK.
I'm not sure when we'll get a chance to make this happen, so it might make sense to run the squash manually in the meantime - not sure who has the permissions to do so.
It would be good to run the numbers on how frequently we expect the squash to run - it looks like 50,000 might be a bit high perhaps? We're at ~70,000 right now, maybe we should aim for ~30,000? Ultimately this is just going to accelerate over time, I guess until we look at switching to e.g. the HTTP-based index (still in development).
Simpleinfra would be fine. My only concern is that whoever has write access to that repository will implicitly have write access to the index, and I think it would be good to keep that to a minimum. If the set of accounts that have write access to both repos is already the same, then it doesn't matter.
Here are some rough numbers:
Size | Commits | Days |
---|---|---|
144MB | 69,500 | 161 |
113MB | 50,000 | 121 |
74MB | 30,000 | 77 |
35MB | 1 |
I'd be fine with tweaking the number.
Hmm, I would prefer if this was converted to a crates.io background job rather than a workflow somewhere in a repo. Do you think that's feasible?
FWIW I think the idea of a workflow works well because the Cargo team has basically been the ones managing this and a workflow is easier to update, debug, and work with than something built-in to crates.io.
I would like to see this as a crates.io background job as well. That process already has the necessary credentials and a local clone of the index. From an ops perspective, any log output would be searchable alongside our other crates.io index operations and we can integrate it with our metrics as we build those out.
The downside is that it will take a bit more work upfront to convert the existing script into a background job. The main thing I see is the --force-with-lease=
option, which there may not be bindings for, I haven't checked yet. Though that option wouldn't be strictly necessary in a background job because the job pool already wraps the repo in a lock and ensures only 1 such job runs at a time.
From past experience the Cargo team has been the one to manage this because we've been around and available to act on this. It's totally fine if others are busy, and there's no issue with that! This is why my preference is to not bake this deeply somewhere that the Cargo team doesn't understand (e.g. neither Eric nor I know what a crates.io background job or where to even begin to implement it).
The goal here is to alleviate me from executing a personal script every few months. That's the status quo right now and it would be much better to automate.
If the crates.io/infra teams are willing to help get this all implemented in crates.io, that's great! My impression though is that y'all are quite busy with other concerns and don't have a ton of time for side issues like this. In that sense I would think that the best route is to allocate an SSH key for a repo in rust-lang which crates.io/cargo have access to which implements this cron job. In the future if crates.io or infra has the time/energy/motivation to move this to something different then it could be done then.
The goal here is to alleviate me from executing a personal script every few months. That's the status quo right now and it would be much better to automate.
I completely agree, and I don't want medium-term goals (like tighter ops integration) to delay progress that can be made now.
In that sense I would think that the best route is to allocate an SSH key for a repo in rust-lang which crates.io/cargo have access to which implements this cron job. In the future if crates.io or infra has the time/energy/motivation to move this to something different then it could be done then.
I don't want to dissuade anyone from working on this if they have the bandwidth to do so now. I've added this topic to the crates.io team agenda for next Friday, to verify the team does have the time, energy, and motivation to take on the responsibility of squashing the index on a regular basis. I think we could easily implement a very similar solution with something like heroku run scripts/squash-index.sh
in place of GH Actions.
I think a reasonable default starting point is to squash the index on a 6 week release cycle. We typically bump our deployment toolchain within a week of a new release, and those deploys already take additional time and a few extra steps. Per the table above, if we squashed roughly every 42 days that will probably keep the index between 35MB and ~60MB, for now. Eventually we'll need to schedule it more frequently.
In the short term, I'd like to make sure I know how to run the existing script. I'll squash the staging index later tonight. If that goes well I'll squash the production index some time this weekend. (If anyone wants to run this against production sooner, that's fine too.)
After the team meeting next Friday, I'll report back to confirm if the team does have the current bandwidth to take on this responsibility. If so, and if the cargo team agrees with the plan, then we'd more forward with the first scheduled squash occurring within a week of the 1.53 release (June 17th).
(A possible outcome is that the crates.io team agrees take on the responsibility for running the squash regularly, but that the cargo team also wants the ability to trigger a squash. If that is the consensus, then a shared repo is probably the best approach for now.)
That sounds like a good plan to me, thanks for taking a look!
but that the cargo team also wants the ability to trigger a squash
I don't think we need that. The idea with the prototype above was to try to make progress with the tools and services that are easily available to me. I was trying to make it so that it would require the absolute minimum amount of effort from other teams. Once it was set up, the intent was that only infra would have access.
I was able to run the squash against the staging index, but it looks like the push to the production index is failing because the master branch is protected. Would someone with admin access to rust-lang/crates.io-index
review the protected branch settings?
To avoid persisting credentials on the local developer's machine, I'm using a slightly modified script that obtains the GIT_SSH_KEY
environment variable from Heroku and temporarily adds the key to the local ssh-agent for 5 minutes.
@jtgeibel to avoid getting the credentials out of Heroku at all, what we could do is to put the script on the crates.io repo, do a deploy and then just heroku run -a crates-io scripts/squash-index.sh
.
to avoid getting the credentials out of Heroku at all, what we could do is to put the script on the crates.io repo, do a deploy and then just
heroku run -a crates-io scripts/squash-index.sh
.
@pietroalbini that was my original plan, but then I remembered that the deployed slug on Heroku doesn't include source/files from the git repo. With some tweaks something like scripts/squash-index.sh > heroku run -a crates-io
should work, but I expect we can have the squash integrated in the codebase by the time we want to run it again so hopefully this is the last time we run a script like this locally.
In today's crates.io team meeting, the team agreed that in terms of workload/coordination we have no concerns with scheduling an index squash every ~6 weeks. I have an initial implementation migrating the script into a background job at rust-lang/crates.io@a7efdcdecfd633c6c1af6075f2644f592b2d6123. The main open item is working with infra to determine if we want to allow the SSH key used by the service to do a forced push to the repo or if that should be reserved for a special SSH key. Until now, the service has treated the index as fast-forward-only.
The background job to run the squash has been merged, and was just run. Squashed commit: https://github.com/rust-lang/crates.io-index/commit/3804ec0c71f6e19dacb274e07d009faf3f106882
The cargo index has been squashed again: https://github.com/rust-lang/crates.io-index/commit/8fe6ce0558479f48e4da8c6e6695f1b7bbc445d0
I've started noticing that crates io index fetching is taking a while again on slow connections/cpus. It looks like we're at more commits (44k) than before we last squashed(34k). Is it time to schedule a new squash?
Thanks for reminder @adamncasey. The index has been squashed.
Previous HEAD was rust-lang/crates.io-index@94b5429, now on the
snapshot-2021-12-21
branch
The index has been squashed.
Previous HEAD was ba5efd5
, now on the snapshot-2022-03-02
branch. The snapshot-2021-12-21
branch has been deleted, and the new snapshot branch has been archived to the rust-lang/crates.io-index-archive repo.
@jtgeibel I was wondering if you could look at squashing again. I'm not sure if that is in a cron job or if it is still manual. It looks like it has been about 4 months since the last squash.
The index is currently 237MB which is about the largest I've ever seen it, which can take a considerable amount of time to clone and unpack.
Thanks for the ping @ehuss, invoking the squash is still manual. We still need to automate the archiving (to the archive repo) and eventual deletion of the snapshot branches (from the main repo).
Previous HEAD was 075e7a6
and is now the snapshot-2022-07-06 branch in the archive repo. I plan to remove this branch from the main repo in 7-10 days.
@jtgeibel Just checking in again to see if we can get another squash. The index is currently over 150MB and 34434 commits and takes about a minute to clone on a fast-ish system.
Previous HEAD was 31a1d8c9b1f6851c9b248813b5bb883ba5297883
, now archived in the snapshot-2022-08-31 branch.
This is the next to smallest snapshot in terms of commits. I just deleted a temporary branch that was left behind on the main repo, so it is possible we weren't getting optimal compression server side. I plan to remove the snapshot branch from the main repo in about 10 days.
Last (only) time: https://internals.rust-lang.org/t/cargos-crate-index-upcoming-squash-into-one-commit/8440 we had 100k+ commits and we thought we weighted a little too long (given how smoothly it went), now we have 51k + ~1.5k/week.
The Cargo team discussed this today and we think we should do this soon. Not interrupt whatever you are working on, but when you have a chance. Who has the permissions to run that script? Is it just @alexcrichton?
As the index grows we should have a policy for when we plan to do the squash. When we have a policy we should plan to make a bot to ensure we follow it. It is reasonable to say that it is too soon. Or we could make a simple policy for now and grow it as we need. The Cargo team discussed a policy like "when we remember approximately every 3-6 months" or "... approximately at 50k commits" or "... approximately when the squash is half the size of the history"