rust-lang / simpleinfra

Rust Infrastructure automation
MIT License
146 stars 74 forks source link

Migrate docs.rs to RDS and ECS #353

Open jdno opened 11 months ago

jdno commented 11 months ago

Questions

syphar commented 11 months ago

another thing we need to figure out:

syphar commented 11 months ago

after checking our NGINX config there is a second piece we need to solve somehow:

IP blocks.

Every now and then we have a mis-acting crawler and in these cases we blocked the source IP in NGINX on our server.

I would prefer to have this in AWS / CloudFront if possible.

Otherwise we would add this to our web container, probably configured via environment variable?

syphar commented 11 months ago

Next piece we need before prod :

Access to logs

jdno commented 11 months ago

For blocking IPs, we should just set up a web-application firewall (AWS WAF). I actually think that we already have one set up for docs.rs, but I'm not 100% sure.

Access to the logs is a good point! It probably makes sense to stream all logs to a central place, whether that's CloudWatch or an external tool like Datadog.

meysam81 commented 10 months ago

@jdno Please let me know if you need a hand with any of the items in this list 🙂

syphar commented 9 months ago

@jdno coming from this discussion I want to add here that the docs.rs containers / servers should not be reachable directly from the internet. So all traffic needs to go through CloudFront & AWS WAF

syphar commented 6 months ago

One thought I had thinking about this topic again:

from https://github.com/rust-lang/docs.rs/issues/1871#issuecomment-1268744723

Looking at https://docs.rs/releases/activity it seems we average at least 600 releases per day. If an average invalidation takes 5 minutes and we can have 15 in parallel, that's 3 invalidations per minute throughput. With 1440 minutes in a day, we could handle up to 4320 builds per day before we wind up in unbounded growth land. Of course, that's based on a significant assumption about how long an invalidation takes.

I'm not sure if we can / should handle invalidations differently, but we might think about using fastly when we rework the infra?

Mark-Simulacrum commented 6 months ago

Can't we de-duplicate invalidations if we approach the limit? E.g., a * invalidation every 5 minutes would presumably never hit the limit. Not sure how that would affect cache hit rates, but I'd expect designing around not needing invalidations or being ok with fairly blanket invalidations to be a good long-term strategy.

(I think we've had this conversation elsewhere before).

syphar commented 6 months ago

Can't we de-duplicate invalidations if we approach the limit? E.g., a * invalidation every 5 minutes would presumably never hit the limit.

You mean "escalating" them, so when the queue is too long, just convert the queue into a full purge. This is definitely would work, but would mean that the user experience (especially outside the US) is worse until the cache is fuller again. Of course this might be acceptible for us.

being ok with fairly blanket invalidations

This also means that the backend always has to be capable to handle the full uncached load, and higher egress costs depending on how often we have to do the full purge.

I also remember a discussion at EuroRust that we could think about having additional docs.rs webservers (also readonly DB & local bucket?) in some regions (europe?).

I'd expect designing around not needing invalidations

You're right, this is a valid discussion to have. I imagine this would only work when the URLs would include something like the build-number in the URL, and replace the more generic URLs rest with redirects. If I'm not missing something this would revert some of the SEO & URL work from https://github.com/rust-lang/docs.rs/issues/1438 (introducing /latest/ URLs ). And then people would start linking specific docs-builds in their sites as they did before we had /latest/.

(I think we've had this conversation elsewhere before).

you're probably right :)

I wanted to bring it up here as a point, for when we migrate infra anyways.

Mark-Simulacrum commented 6 months ago

Note that (IMO) if we can get the cache keys setup right, i.e. everything except HTML is always at a by-hash file path - it seems to me that /latest/ can just be served with a short ttl (5 minutes), perhaps with stale-while-revalidate. That means that there's a small period where it's not necessarily consistent what version you get from it across all pages if some are cached locally and some aren't (and likewise for CDN), but I don't see any real problem with that. Users mostly won't even notice.

Yes, especially anything out of s3 can definitely be replicated if we need it to be into multiple regions pretty easily. This just causes issues while you need invalidations since you're racing against replication which can itself take some time (hours IIRC for the cheap option and minutes for the costly one?)

syphar commented 6 months ago

Note that (IMO) if we can get the cache keys setup right, i.e. everything except HTML is always at a by-hash file path - it seems to me that /latest/ can just be served with a short ttl (5 minutes), perhaps with stale-while-revalidate. That means that there's a small period where it's not necessarily consistent what version you get from it across all pages if some are cached locally and some aren't (and likewise for CDN), but I don't see any real problem with that. Users mostly won't even notice.

Jep, everything except HTML should have already have hashed filenames, with some small exceptions. For HTML I (personally) would prefer longer caching duration, 5 minutes outdated is probably fine, not sure now much that would reduce the user happiness for some crates. I'll probably try to get better data on how the cache coverage for certain crates is at some point and see in more detail how the impact would be on users. And it might also be the case that it's just me that needs these kind of response times for docs :)

Yes, especially anything out of s3 can definitely be replicated if we need it to be into multiple regions pretty easily. This just causes issues while you need invalidations since you're racing against replication which can itself take some time (hours IIRC for the cheap option and minutes for the costly one?)

That's good to know, thanks!