When should we next squash the index?

Eh2406 commented 5 years ago

Last (only) time: https://internals.rust-lang.org/t/cargos-crate-index-upcoming-squash-into-one-commit/8440 we had 100k+ commits and we thought we weighted a little too long (given how smoothly it went), now we have 51k + ~1.5k/week.

The Cargo team discussed this today and we think we should do this soon. Not interrupt whatever you are working on, but when you have a chance. Who has the permissions to run that script? Is it just @alexcrichton?

As the index grows we should have a policy for when we plan to do the squash. When we have a policy we should plan to make a bot to ensure we follow it. It is reasonable to say that it is too soon. Or we could make a simple policy for now and grow it as we need. The Cargo team discussed a policy like "when we remember approximately every 3-6 months" or "... approximately at 50k commits" or "... approximately when the squash is half the size of the history"

alexcrichton commented 5 years ago

The actual script:

the script

```bash set -ex now=`date '+%Y-%m-%d'` git fetch origin git reset --hard origin/master head=`git rev-parse HEAD` git push -f git@github.com:rust-lang/crates.io-index $head:refs/heads/snapshot-$now msg=$(cat <<-END Collapse index into one commit Previous HEAD was $head, now on the \`snapshot-$now\` branch More information about this change can be found [online] and on [this issue] [online]: https://internals.rust-lang.org/t/cargos-crate-index-upcoming-squash-into-one-commit/8440 [this issue]: https://github.com/rust-lang/crates-io-cargo-teams/issues/47 END ) new_rev=$(git commit-tree HEAD^{tree} -m "$msg") git push \ git@github.com:rust-lang/crates.io-index \ $new_rev:refs/heads/master \ --force-with-lease=refs/heads/master:$head ``` Edit: to include the critical `--force-with-lease` that was in https://internals.rust-lang.org/t/cargos-crate-index-upcoming-squash-into-one-commit/8440/31?u=eh2406

only requires push access to the crates.io-index, which any admin of the rust-lang GitHub organization has (and probably more).

I think it'd be best to do some measurements here directly correlated with the metrics we care about. The original rationale for squashing was that initial clones took quite a long time downloading so much history. As a result I would suspect that we should establish thresholds along the lines of "how big is the download and how much would we save with a squash"?

sgrif commented 5 years ago

which any admin of the rust-lang GitHub organization has (and probably more).

I'm also able to do it, and bors of course can (dunno if bors is an admin). I think that's it though.

This was discussed at the crates.io meeting. Here were the key points.

It makes sense for some amount of this to be automated. crates.io has the infra for this, and is happy to have this on our servers.
The last time this happened there was some public communication around it. Is this something that we want to do for each squash? If not, we can just automate the whole process.
@sgrif points out that at least one crawler looks at commit dates to avoid having to ask crates.io for update times, so we should ensure that the history branch remains consistently named. The script already does this, but we should make sure that piece doesn't change
We can automate this either by time or by commit count.
If we don't want to completely automate the whole process, we can still automate checking whatever threshold we want and just send an email to relevant folks.
The crates.io team is happy to take on the work around this (it's relatively minor for us)
@joshtriplett is interested in seeing if it's possible to set this up so that git is able to do a smaller update for folks who would have done a fast forward had we not squashed.
Ultimately if we get the "when to squash" threshold wrong, there was consensus that the cost is relatively minor

The main unresolved questions, which we'd like to get answers from the Cargo team on, are:

Are we ok with completely automating the whole process, and therefore losing the ability to communicate beforehand?
Should the threshold be time based or commit based?
What should the threshold be?

My personal answer to those questions, which does not represent consensus among any team(s) are:

Yes, we should automate it and not communicate when it happens. Nobody noticed last time. The only parties affected are crawlers, who have to handle this either way, and are better served by this being automated (and therefore ensured consistent).
Commit based. While time based is probably better for crawlers (they can just assume there's a history branch every 6 months, etc), our primary focus should be on human users of Cargo. Commit count is ultimately the main factor behind all the problems we intend to solve by squashing.
75k commits. This is very uninformed, and entirely based on the issue description saying we thought 100k was too long to wait last time. This is an easy number to make configurable, and we should probably just experiment with what feels like the best balance.

Eh2406 commented 5 years ago

A follow up to @joshtriplett suggestion.

To clone the index as is git clone -b master --single-branch https://github.com/rust-lang/crates.io-index.git downloads 61.9MiB Then to fetch the squash that I made is git fetch https://github.com/Eh2406/crates.io-index.git master does not redownload the data! If I dell that checkout, and clone the index from my squash git clone -b master --single-branch https://github.com/Eh2406/crates.io-index.git downloads 17.26MiB

So apparently we can get git to do this correctly! (Others should check if they are getting the same results.) The thing I tried https://github.com/Eh2406/crates.io-index/commit/65419fd5f5b9758b95fa08f207276639b1426e43 is to add a new squash commit on top of the existing one from last time. I did not make a script just did it manually. It may be sufficient to just share the same root commit, if someone wants to give that a try.

Eh2406 commented 5 years ago

Looks like it works with the root in common, using git fetch https://github.com/Eh2406/crates.io-index.git test.

The root can be found with root = git rev-list --max-parents=0 HEAD Then the penultimate line can be new_rev=$(git commit-tree HEAD^{tree} -m "$msg" -p $root) And everything should work.

alexcrichton commented 5 years ago

For my own personal takes on some of the unresolved questions:

Are we ok with completely automating the whole process, and therefore losing the ability to communicate beforehand?

I don't have any problem with losing communication about this, I don't think it's really all that important especially now that it went so smoothly the first time. I do have a slightly different concern though. I think it would be a failure mode of Cargo if the index were automatically rolled up every day (defeating the purpose of delta updates), and having a fully automated process may cause us to not realize we're getting close to that situation.

I am, however, very much in favor of automation. So to allay my concern I would request that a notification of some form be sent out to interested team members when a squash happens. (aka I just want an email of some form)

Should the threshold be time based or commit based?

I would personally measure this in megabytes of data to download rather then either metric you mentioned, but commits are likely a good proxy for the megabytes being downloaded. My ideal metric would be something like "we shave 100MB off a clean download of the index", and the 100 number there is pulled out of thin air and could be more like 50 or something like that.

What should the threshold be?

I think the first index squash went from roughly 90MB to 10MB (ish) for a clean initial download. Along those lines I'd say that a squash should save at least 70MB before squashing.

Eh2406 commented 5 years ago

I think it would be a failure mode of Cargo if the index were automatically rolled up every day (defeating the purpose of delta updates)

@alexcrichton One question, if git can download a roll up in O(delta) work would you still think this is a failure mode?

alexcrichton commented 5 years ago

AFAIK git just downloads objects and doesn't do any diffing at the fetch layer. Delta updates work because most indexes have a huge shared history. If we roll into one commit frequently there's no shared history so git will keep downloading the entire new history, which would be fresh each time.

So to answer your question, I don't believe git can have any sort of delta update when the history is changed and so I would still consider it a failure mode.

smarnach commented 5 years ago

For users who already have the latest version of the index, Git will generally see that the tree object for the single squashed commit is identical to the tree object it already has (since it has the same hash), so it will only donwload the single new commit object.

So another solution may be to always keep, say, the last month's worth of commits in the history, and only squash the bits that are older than one month. All users who have updated in the month before squashing will be able to download deltas, and only users with an even older version of the index will have to redownload it in full.

When squashing the old commits, all commits on top of them will have to be rewritten, so users will have to redownload the commit objects. However, commit objects hardly contain any data, and the associated tree objects are identical, so they won't be retransmitted.

I did some experiments for this approach, and got somewhat mixed results with what Git is able to detect, but I believe it is possible to make it work. It would require some work to figure out the details, though.

smarnach commented 5 years ago

We had some discussion in the crates.io Discord channel (can't figure out how to permalink it), and things aren't quite as easy as indicated in my previous comment. I may have time to do some experiments later this week, but I don't make any promises.

Eh2406 commented 5 years ago

link to the discussion: https://discordapp.com/channels/442252698964721669/448525639469891595/597888610376613901

Eh2406 commented 5 years ago

We did not have time to discuss this at the Cargo meeting today. So we don't have any new answers for @sgrif.

I would request that a notification of some form be sent out

I was thinking maybe we open and issue on the index repo and have the script add a comment there, then anyone interested (in teams or not) can subscribe to that issue to get notifications. I would want to look into @nemo157 suggestions for how to get git not to download the history at all well before we start doing a squash every week.

I think the first index squash went from roughly 90MB to 10MB (ish)

>git clone -b master --single-branch https://github.com/rust-lang/crates.io-index.git
...
Receiving objects: 100% (297740/297740), 67.54 MiB | 5.79 MiB/s, done.

>git clone -b master --single-branch https://github.com/smarnach/crates.io-index
Cloning into 'crates.io-index'...
...
Receiving objects: 100% (36539/36539), 14.01 MiB | 5.75 MiB/s, done.

So it looks like we save ~54 MiB today. Assuming a linear size per commit then we would hit 70 MiB saved at ~ 72K Commits. So it looks like people's instincts are approximately in the same ballpark.

joshtriplett commented 5 years ago

It sounds like we don't need to keep a window of commits on the main branch, and we just need to archive the squashed-away commits on an archive branch? And since the server has those available it can do deltas from those objects? That sounds perfect.

Eh2406 commented 5 years ago

We discussed this at the Cargo meeting today.

The main unresolved questions, which we'd like to get answers from the Cargo team on, are:

Are we ok with completely automating the whole process, and therefore losing the ability to communicate beforehand?

Yes! Several of us would like some form of notification when it happens, but it does not need to be in advance and we do not need to publicize the event.

Should the threshold be time based or commit based?

We realized that it was hard to make a decision do to a bikeshed effect, we all had different opinions but not strong enough to convince anyone. So we decided whatever is easiest for you to set up. If you need someone to make a decision, A daly check if we are over the commit limit.

What should the threshold be?

After some discussion @ehuss pointed out that it is already noticeable, and @nrc pointed out that we want to have the script do something the first time it runs. We don't want it to break things on some random day in 3 month when we have non of this paged in. So if it is time based then every 6 months, if it is commit based then 50k. Most importantly We can monitor it and adjust the threshold later if needed.

We had some discussion of whether this will cause existing users to download the full index on each squash day. My understanding from our discussion with @Nemo157 and @smarnach on discord is that the current plan will not trigger a full download. The Github repo will always have a commit referencing all tree objects that the client will have, so Github will have what it needs to do a delta even when master has just been squashed. No git-gc can remove the tree objects as there used by a backup branch. @ehuss wanted to recheck to make sure that this works as hoped.

sgrif commented 5 years ago

Will move forward with a prototype that squashes when the commit count is >50k

ehuss commented 5 years ago

I've been doing some tests, and Alex's original script seems to work pretty well. I've tried with a copy fetched by cargo that is anywhere from 10 to 1,000 to 10,000 commits old, and it seemed to properly download just the minimum necessary.

A fresh download (delete CARGO_HOME) from a squashed index is about a 15MB download, which uses about 16MB of disk space. Compare that to the current size which is about 73MB download using about 79MB of disk space.

The only issue I see is that for existing users, it does not release the disk usage. The only way I've determined to delete the old references is to run:

git reflog expire --expire=now --all
git gc --prune=now

Cargo currently has a heuristic where it automatically runs git gc occasionally. Perhaps it could be extended to run the above commands? It could be a big win for disk usage. What do people think?

alexcrichton commented 5 years ago

I'd be totally down for expanding Cargo's gc commands, and if Cargo can share indexes even across squashes that's even better!

Eh2406 commented 5 years ago

@ehuss looks (https://git-scm.com/docs/git-reflog) like the git gc dose a --expire=90days by default and we can change the gc.reflogExpire config to set a shorter duration.

@sgrif what is the progress on the prototype?

alexcrichton commented 5 years ago

@sgrif this recently came up again on internals, wanted to ping again if you've got progress on a prototype?

I don't mind running the script manually nowadays one more time before we get automation set up again. If I don't hear back from you in a week or so I'll go ahead and do that and we can continue along the automation track!

alexcrichton commented 5 years ago

Ok I briefly talked with @sgrif on IRC and the index has been squashed! We'll be sure to have automation for the next one :)

ehuss commented 4 years ago

It looks like the index has grown considerably since the last squash (looks like it is 75MB now, and can be squashed down to about 20MB). @rust-lang/crates-io is there any progress on automating the process? Is there anything I can do to help? If there are barriers to setting up a cron job, can someone run the script manually?

alexcrichton commented 4 years ago

I've re-squashed the index

gziskind commented 4 years ago

When you squash the index in the future, are you able squash it for, as an example, everything older than 1 week instead of every commit in the repo at the time its squashed?

I only ask because I currently am using the commit history as a changes feed for the crates index and if all commits are squashed one day, i would potentially lose any changes since the last time my automated process checked the commit history. This would give me a week buffer to run it before losing any information

Eh2406 commented 4 years ago

I don't think so. A commit with a long history does not have the same hash as a commit with 1 week of history. So if you only walk master, your just going to see new commits that happen to do the same thing as the old commits but are not equal. The code to handle that, may as well be code to walk the backup branches, feels like the same level of complexity.

Nemo157 commented 4 years ago

If you just compare the trees rather than walking commits it should work fine (e.g. from looking at the code I think crates-index-diff should work fine across a squash, and I don't recall docs.rs which uses it having any issues around March).

Eh2406 commented 4 years ago

Looks like it may be that time once again.

jtgeibel commented 4 years ago

Looks like it may be that time once again.

This was last squashed on 2020-08-04, so we will need to automate the squashing if we're looking at doing this every few months.

alexcrichton commented 4 years ago

I've done a squash, reducing the size from 80 to 30 MB.

ehuss commented 3 years ago

I would like to move this forward. The index has gotten quite large again and it takes a long time to download.

I'd like to propose running a cron job from GitHub Actions which will squash the index when it crosses a threshold (I'm proposing 50,000 commits).

Due to the way GitHub Actions cron jobs work, they can only be triggered from the default branch of a repository. We would prefer to not do that in any of the existing repos, so a new repo will need to be created to house the script.

I have created a prototype at https://github.com/ehuss/crates.io-index-squash. It contains a simple shell script which squashes the index. It runs once a day, and can be manually triggered.

I decided to use SSH keys since their scope can be narrowed more easily than auth tokens can. It can be easily changed to an auth token if people prefer.

The steps to make this live are:

Create the repository in rust-lang.
Push a commit with the script and action definition.
Create an ssh-key to authorize pushing to crates.io-index:
1. ssh-keygen -t ed25519 -C "your_email@example.com"
  1. Place id_ed25519 in local directory.
  2. No passphrase.
2. In rust-lang/crates.io-index, go to Settings > Deploy Keys, and add a new key.
  1. Paste the contents of id_ed25519.pub
  2. Check "Allow write access"
3. In the new repository, go to Settings > Secrets and add a Repository Secret called CRATES_IO_SSH_KEY with the contents of id_ed25519.

@Mark-Simulacrum or @PietroAlbini, would either of you be willing to help make this happen? Or if you would prefer a different approach, I'm willing to help.

Mark-Simulacrum commented 3 years ago

This looks reasonable to me - it may make sense to have the script and GHA config live in the simpleinfra repository, rather than a dedicated one (just for ease of having things in one place). I'm not enthusiastic about triggering it on a cronjob (once per day), but it seems OK.

I'm not sure when we'll get a chance to make this happen, so it might make sense to run the squash manually in the meantime - not sure who has the permissions to do so.

It would be good to run the numbers on how frequently we expect the squash to run - it looks like 50,000 might be a bit high perhaps? We're at ~70,000 right now, maybe we should aim for ~30,000? Ultimately this is just going to accelerate over time, I guess until we look at switching to e.g. the HTTP-based index (still in development).

ehuss commented 3 years ago

Simpleinfra would be fine. My only concern is that whoever has write access to that repository will implicitly have write access to the index, and I think it would be good to keep that to a minimum. If the set of accounts that have write access to both repos is already the same, then it doesn't matter.

Here are some rough numbers:

Size	Commits	Days
144MB	69,500	161
113MB	50,000	121
74MB	30,000	77
35MB	1

I'd be fine with tweaking the number.

pietroalbini commented 3 years ago

Hmm, I would prefer if this was converted to a crates.io background job rather than a workflow somewhere in a repo. Do you think that's feasible?

alexcrichton commented 3 years ago

FWIW I think the idea of a workflow works well because the Cargo team has basically been the ones managing this and a workflow is easier to update, debug, and work with than something built-in to crates.io.

jtgeibel commented 3 years ago

I would like to see this as a crates.io background job as well. That process already has the necessary credentials and a local clone of the index. From an ops perspective, any log output would be searchable alongside our other crates.io index operations and we can integrate it with our metrics as we build those out.

The downside is that it will take a bit more work upfront to convert the existing script into a background job. The main thing I see is the --force-with-lease= option, which there may not be bindings for, I haven't checked yet. Though that option wouldn't be strictly necessary in a background job because the job pool already wraps the repo in a lock and ensures only 1 such job runs at a time.

alexcrichton commented 3 years ago

From past experience the Cargo team has been the one to manage this because we've been around and available to act on this. It's totally fine if others are busy, and there's no issue with that! This is why my preference is to not bake this deeply somewhere that the Cargo team doesn't understand (e.g. neither Eric nor I know what a crates.io background job or where to even begin to implement it).

The goal here is to alleviate me from executing a personal script every few months. That's the status quo right now and it would be much better to automate.

If the crates.io/infra teams are willing to help get this all implemented in crates.io, that's great! My impression though is that y'all are quite busy with other concerns and don't have a ton of time for side issues like this. In that sense I would think that the best route is to allocate an SSH key for a repo in rust-lang which crates.io/cargo have access to which implements this cron job. In the future if crates.io or infra has the time/energy/motivation to move this to something different then it could be done then.

jtgeibel commented 3 years ago

The goal here is to alleviate me from executing a personal script every few months. That's the status quo right now and it would be much better to automate.

I completely agree, and I don't want medium-term goals (like tighter ops integration) to delay progress that can be made now.

In that sense I would think that the best route is to allocate an SSH key for a repo in rust-lang which crates.io/cargo have access to which implements this cron job. In the future if crates.io or infra has the time/energy/motivation to move this to something different then it could be done then.

I don't want to dissuade anyone from working on this if they have the bandwidth to do so now. I've added this topic to the crates.io team agenda for next Friday, to verify the team does have the time, energy, and motivation to take on the responsibility of squashing the index on a regular basis. I think we could easily implement a very similar solution with something like heroku run scripts/squash-index.sh in place of GH Actions.

I think a reasonable default starting point is to squash the index on a 6 week release cycle. We typically bump our deployment toolchain within a week of a new release, and those deploys already take additional time and a few extra steps. Per the table above, if we squashed roughly every 42 days that will probably keep the index between 35MB and ~60MB, for now. Eventually we'll need to schedule it more frequently.

In the short term, I'd like to make sure I know how to run the existing script. I'll squash the staging index later tonight. If that goes well I'll squash the production index some time this weekend. (If anyone wants to run this against production sooner, that's fine too.)

After the team meeting next Friday, I'll report back to confirm if the team does have the current bandwidth to take on this responsibility. If so, and if the cargo team agrees with the plan, then we'd more forward with the first scheduled squash occurring within a week of the 1.53 release (June 17th).

(A possible outcome is that the crates.io team agrees take on the responsibility for running the squash regularly, but that the cargo team also wants the ability to trigger a squash. If that is the consensus, then a shared repo is probably the best approach for now.)

ehuss commented 3 years ago

That sounds like a good plan to me, thanks for taking a look!

but that the cargo team also wants the ability to trigger a squash

I don't think we need that. The idea with the prototype above was to try to make progress with the tools and services that are easily available to me. I was trying to make it so that it would require the absolute minimum amount of effort from other teams. Once it was set up, the intent was that only infra would have access.

jtgeibel commented 3 years ago

I was able to run the squash against the staging index, but it looks like the push to the production index is failing because the master branch is protected. Would someone with admin access to rust-lang/crates.io-index review the protected branch settings?

To avoid persisting credentials on the local developer's machine, I'm using a slightly modified script that obtains the GIT_SSH_KEY environment variable from Heroku and temporarily adds the key to the local ssh-agent for 5 minutes.

Script

```bash #!/usr/bin/env bash set -e if [[ $# -ne 1 ]]; then echo "Usage: Provide a single argument, either crates.io or staging.crates.io" exit 2 fi domain="$1" app=${domain//\./-} repo="git@github.com:rust-lang/$1-index" echo Domain: $domain echo App: $app echo Repo: $repo echo "Press space to continue, or Ctrl-C to exit" read -s -d ' ' key=`heroku config:get GIT_SSH_KEY --app "$app" | base64 -d` echo "$key" | ssh-add -t 5m - now=`date '+%Y-%m-%d'` git fetch origin git reset --hard origin/master head=`git rev-parse HEAD` git push -f "$repo" $head:refs/heads/snapshot-$now msg=$(cat <<-END Collapse index into one commit Previous HEAD was $head, now on the \`snapshot-$now\` branch More information about this change can be found [online] and on [this issue] [online]: https://internals.rust-lang.org/t/cargos-crate-index-upcoming-squash-into-one-commit/8440 [this issue]: https://github.com/rust-lang/crates-io-cargo-teams/issues/47 END ) new_rev=$(git commit-tree HEAD^{tree} -m "$msg") git push \ "$repo" \ $new_rev:refs/heads/master \ --force-with-lease=refs/heads/master:$head ```

Script output and error message

```console $ ./squash.sh crates.io Domain: crates.io App: crates-io Repo: git@github.com:rust-lang/crates.io-index Press space to continue, or Ctrl-C to exit Identity added: (stdin) ((stdin)) Lifetime set to 300 seconds remote: Enumerating objects: 9, done. remote: Counting objects: 100% (9/9), done. remote: Compressing objects: 100% (4/4), done. remote: Total 5 (delta 2), reused 3 (delta 0), pack-reused 0 Unpacking objects: 100% (5/5), 620 bytes | 103.00 KiB/s, done. From https://github.com/rust-lang/crates.io-index bec47523fe..71d042cd93 master -> origin/master HEAD is now at 71d042cd93 Updating crate `fbt-lib#0.1.1` Total 0 (delta 0), reused 0 (delta 0), pack-reused 0 remote: remote: Create a pull request for 'snapshot-2021-05-02' on GitHub by visiting: remote: https://github.com/rust-lang/crates.io-index/pull/new/snapshot-2021-05-02 remote: To github.com:rust-lang/crates.io-index * [new branch] 71d042cd93665b5bf4d9239fca0136ead697b116 -> snapshot-2021-05-02 Enumerating objects: 76560, done. Counting objects: 100% (76560/76560), done. Delta compression using up to 8 threads Compressing objects: 100% (37462/37462), done. Writing objects: 100% (76560/76560), 34.29 MiB | 10.50 MiB/s, done. Total 76560 (delta 39526), reused 62465 (delta 29916), pack-reused 0 remote: Resolving deltas: 100% (39526/39526), done. remote: error: GH006: Protected branch update failed for refs/heads/master. remote: error: Cannot force-push to this protected branch To github.com:rust-lang/crates.io-index ! [remote rejected] 63302f1e297bb1ba9f707d0e2640ecb246634ddd -> master (protected branch hook declined) error: failed to push some refs to 'github.com:rust-lang/crates.io-index' ```

pietroalbini commented 3 years ago

@jtgeibel to avoid getting the credentials out of Heroku at all, what we could do is to put the script on the crates.io repo, do a deploy and then just heroku run -a crates-io scripts/squash-index.sh.

jtgeibel commented 3 years ago

to avoid getting the credentials out of Heroku at all, what we could do is to put the script on the crates.io repo, do a deploy and then just heroku run -a crates-io scripts/squash-index.sh.

@pietroalbini that was my original plan, but then I remembered that the deployed slug on Heroku doesn't include source/files from the git repo. With some tweaks something like scripts/squash-index.sh > heroku run -a crates-io should work, but I expect we can have the squash integrated in the codebase by the time we want to run it again so hopefully this is the last time we run a script like this locally.

pietroalbini commented 3 years ago

Ran a manual squash: https://github.com/rust-lang/crates.io-index/commit/4a4435768950e85c33c0003092ef7740452af85c

jtgeibel commented 3 years ago

In today's crates.io team meeting, the team agreed that in terms of workload/coordination we have no concerns with scheduling an index squash every ~6 weeks. I have an initial implementation migrating the script into a background job at rust-lang/crates.io@a7efdcdecfd633c6c1af6075f2644f592b2d6123. The main open item is working with infra to determine if we want to allow the SSH key used by the service to do a forced push to the repo or if that should be reserved for a special SSH key. Until now, the service has treated the index as fast-forward-only.

jtgeibel commented 3 years ago

The background job to run the squash has been merged, and was just run. Squashed commit: https://github.com/rust-lang/crates.io-index/commit/3804ec0c71f6e19dacb274e07d009faf3f106882

jtgeibel commented 3 years ago

The cargo index has been squashed again: https://github.com/rust-lang/crates.io-index/commit/8fe6ce0558479f48e4da8c6e6695f1b7bbc445d0

adamncasey commented 2 years ago

I've started noticing that crates io index fetching is taking a while again on slow connections/cpus. It looks like we're at more commits (44k) than before we last squashed(34k). Is it time to schedule a new squash?

jtgeibel commented 2 years ago

Thanks for reminder @adamncasey. The index has been squashed.

Previous HEAD was rust-lang/crates.io-index@94b5429, now on the snapshot-2021-12-21 branch

jtgeibel commented 2 years ago

The index has been squashed.

Previous HEAD was ba5efd5, now on the snapshot-2022-03-02 branch. The snapshot-2021-12-21 branch has been deleted, and the new snapshot branch has been archived to the rust-lang/crates.io-index-archive repo.

ehuss commented 2 years ago

@jtgeibel I was wondering if you could look at squashing again. I'm not sure if that is in a cron job or if it is still manual. It looks like it has been about 4 months since the last squash.

The index is currently 237MB which is about the largest I've ever seen it, which can take a considerable amount of time to clone and unpack.

jtgeibel commented 2 years ago

Thanks for the ping @ehuss, invoking the squash is still manual. We still need to automate the archiving (to the archive repo) and eventual deletion of the snapshot branches (from the main repo).

Previous HEAD was 075e7a6 and is now the snapshot-2022-07-06 branch in the archive repo. I plan to remove this branch from the main repo in 7-10 days.

ehuss commented 2 years ago

@jtgeibel Just checking in again to see if we can get another squash. The index is currently over 150MB and 34434 commits and takes about a minute to clone on a fast-ish system.

jtgeibel commented 2 years ago

Previous HEAD was 31a1d8c9b1f6851c9b248813b5bb883ba5297883, now archived in the snapshot-2022-08-31 branch.

This is the next to smallest snapshot in terms of commits. I just deleted a temporary branch that was left behind on the main repo, so it is possible we weren't getting optimal compression server side. I plan to remove the snapshot branch from the main repo in about 10 days.

rust-lang / crates-io-cargo-teams

When should we next squash the index? #47