rust-lang / infra-team

Coordination repository for the Rust infra team
https://www.rust-lang.org/governance/teams/infra
Apache License 2.0
16 stars 8 forks source link

Back up Rust releases and crates #122

Open jdno opened 2 months ago

jdno commented 2 months ago

Currently, all of Rust's releases and all crates are stored on AWS. While we have multiple measures in place to prevent accidental deletion of releases or crates, e.g. bucket replication to a different region and restricted access, our current setup does not sufficiently protect us against a few threats:

Therefore, we want to set up automated out-of-band backups for both Rust releases and crates. These backups will be hosted in GCP and have totally separate access controls compared to AWS. Specifically, none of the current infra-admins should have access to this separate environment to protect against an account compromise.

Tasks

MarcoIeni commented 1 month ago

Proposed solution

Execution plan

  1. With terraform, create the project where we want to store the objects and create an empty google object storage called crates-io. Set the storage class to "archive".
  2. With terraform, create a transfer for the crates-io bucket by entering the CloudFront domain cloudfront-static.crates.io. This step is documented in the s3-cloudfront docs. Select "Run every day" as scheduling option (let me know if you prefer hourly or weekly).
  3. For one week, monitor:
    • price: is it what we expected?
    • correctness: is the google bucket up-to-date with the S3 one?
  4. Do the same for the releases, i.e. static.rust-lang.org.
  5. Find a "monitoring system", that can range from "login every week and manually check that everything is ok" to "configure an alert if the transfer job fail".

FAQ

Does Storage Transfer support AWS S3?

Yes. As you can see here Amazon S3 is supported as Source. Plus, it does not require agents or agent pools.

How much does everything cost?

TL;DR we should only pay for the Object Storage cost.

The Storage Transfer pricing explicitly says "No charges" for Agentless transfers. So traffic for Amazon S3 should be free.

Transfering from aws cloudfront instead of S3 directly reduces the AWS egress cost. This cost should be negligible with respect to the usual crates.io and releases traffic.

The cost of Object Storage depends on the storage class. The cost calculator is here.

Here's an estimate. Please edit "Class A" with the following number:

// each `cargo publish` publishes a ".crate" file
let published = "number of crates users publish every month";
// each `cargo publish` updates the .xml rss file
let corresponding_rss = published;
// each `cargo publish` renders the readme to display on crates.io
let readme_precentage = "percentage of crates with a readmes";
let readmes = published * readme_percentage;
let class_a = published + corresponding_rss + readmes;
println!("{class_a}");

We can drop the cost of this bucket by not storing readmes and rss.

It is important to estimate the number of published crates because if it's very high, "coldline storage" is more convenient than "archive storage" (try yourself in the pricing calculator).

From my understanding "Class A" doesn't increases much for releases, because they only happen every rust release. Instead users publishing crates are way more frequent.

Can we backup only some paths of the bucket?

If we don't want to backup the entire bucket we can use filters which are supported in Agentless transfers.

However, I'm not sure if this solution works with CloudFront. Maybe we can just give the URL path we want to backup? E.g. static.crates.io/crates/.

Anyway, this shouldn't be necessary because we probably want to backup everything in the buckets (unless we realize we might save a lot of money but not backing up readmes and rss)

Do we need a multi-region backup for the object storage?

No. Multi-region only helps if we want to serve this data real-time and we want to have a fallback mechanism if a GCP region fails. We just need this object storage for backup purposes, so we don't need to pay double πŸ‘

Questions

GCP region

Do you have a preference? Let's use cost calculator to pick one of the cheapest regions πŸ‘

Manual test

Should we add a step 0 where we test step 2 in a dummy GCP project without terraform? Just to validate our assumptions.

CDN for releases

I didn't put cloufront- as a prefix for static.rust-lang.org because from my understanding we only serve releases via cloudfront, right?

Buckets

Useful docs

jdno commented 1 month ago

The Storage Transfer pricing explicitly says "No charges" for Agentless transfers. So traffic for Amazon S3 should be free.

Transfering from aws cloudfront instead of S3 directly reduces the AWS egress cost. This cost should be negligible with respect to the usual crates.io and releases traffic.

These two statements seem to contradict each other. But I agree that, if we go through CloudFront, the egress costs for the backups should be absolutely negligible compared to our usual traffic. Even a full one-time backup of both releases and crates should only be in the region of ~90TB, which is marginal compared to our overall monthly volume.

Does anybody have the number of published crates every month?

This number should be easy to get from either the crates.io team or the Foundation's security engineer.

From my understanding "Class A" doesn't increases much for releases, because they only happen every rust release.

Nightly releases are published every day. πŸ˜‰ But the amount of files that are created each day is still way less than on crates.io.

unless we realize we might save a lot of money but not backing up readmes and rss

I would approach this from the perspective of "what do we need to backup", and then figure out how we pay for it afterwards. Given that this is intended as a backup for disaster recovery, we have a strong argument to find the necessary budget.

jdno commented 1 month ago

A question that hasn't been addressed yet are the different access controls for the original files in AWS and the backups in GCP. Who will have access? How do we deal with any issues that monitoring might surface? Who will be able to investigate and resolve those?

MarcoIeni commented 1 month ago

These two statements seem to contradict each other

Let me clarify: traffic for Amazon S3 should be free on gcp bill. We still pay egress cost in the aws bill πŸ‘

Given that this is intended as a backup for disaster recovery, we have a strong argument to find the necessary budget.

Agree πŸ‘ I need to ask to the Foundation's security engineer about this.

MarcoIeni commented 1 month ago

Does anybody have the number of published crates every month?

I got an answer here It's 1k per day, so the "Class A" cost is negligible πŸ‘

EDIT:

Bonus: try terraform cloud or pulumi if I feel like.

MarcoIeni commented 2 weeks ago

Tasks:

Execution plan from hackmd:

  1. In the simpleinfra repo, with terraform, create the project where we want to store the objects and create an empty google object storage called crates-io. Set the storage class to "archive".
  2. Create a transfer for the crates-io bucket by entering the CloudFront domain cloudfront-static.crates.io. This step is documented in the s3-cloudfront docs (however, we want to do this in terraform). Select "Run every day" as scheduling option (hourly or weekly are also an option).
  3. For one week, monitor:
    • price: is it what we expected?
    • correctness: is the google bucket up-to-date with the S3 one?
  4. Do the same for the other buckets.
  5. The infra team works on a "monitoring system".
    • Initially, it's "login every week and manually check that everything is ok", i.e.:
      • Ensure the number of files and the size of the GCP buckets is the same as the respective AWS buckets by looking at the metrics
      • Ensure that only the authorized people have access to the account
    • Later we can prioritize "configuring an alert if the transfer job fails". E.g. we could create an alert in Datadog.
  6. Run the following test:
    • Upload a file in an AWS S3 bucket and check that it appears in GCP.
    • Edit the file in AWS and check that you can recover the previous version from GCP.
    • Delete the in AWS and check that you can recover all previous versions from GCP.

Non blocking questions to be answered before closing this task: