Key-value based rate limiting

letsencrypt / boulder

An ACME-based certificate authority, written in Go.

Mozilla Public License 2.0

5.06k stars 593 forks source link

Key-value based rate limiting #5545

Open jprenken opened 2 years ago

jprenken commented 2 years ago

MySQL-compatible databases are inefficient for some of Boulder's most demanding data storage needs:

~~OCSP responses are perfect for a simpler and more performant key-value data model.~~
Rate limit data (a broad category) is ephemeral, and requires no ACID guarantees.

In order for Boulder to support large and growing environments, it would be nice for it to support Redis as a storage backend for these purposes. For OCSP responses, there will need to be some ability to insert/update data in both MySQL and Redis in parallel (e.g. during a transition).

---- (project tracking added by @beautifulentropy) ----

[x] Design a key-value rate limits system
- 6947
[x] Implement a Redis backing store
- 7016
[x] Configure Redis Ring shards using a SRV record (possibly live-reload?)
- 7041
- 7042
[x] Tidy existing rate limit metrics to achieve parity with new key-value metrics. This will assist in making better comparisons between the two implementations. Metrics should have similar naming, labeling, measurement, points of observation.
- 7053
- 7054
- 7124
[x] (Phase 1) Implement NewAccount and NewOrder rate limits, as a pass-through.
- [x] NewAccount
- 7089
- [x] NewOrder/WFE
  - 7143
  - 7199
  - 7200
  - 7258
  - 7201
- [x] #7346
- [x] Provide a tool which converts override configuration files from the legacy format to the key-value format
- [x] #7508
- [ ] Add verbose user-facing berrors which indicate the specific limit violated, the period of time to wait, and provides links to the relevant documentation.
[ ] (Phase 2) Add a feature-flag which makes key-value rate limits authoritative but leave existing legacy rate limits in-place.
[ ] (Phase 2) Make key-value rate limits authoritative in staging, validate near parity of decisions
[ ] (Phase 2) Make key-value rate limits authoritative in production, validate near parity of decisions
[ ] (Phase 3) Remove existing rate limits.
- [ ] #7511

jcjones commented 2 years ago

I would recommend we split this ticket apart.

Rate limit data is very well suited for key-value stores, which often have Redis-compatible command sets (we probably shouldn't use Redis itself when there are memory-safe alternatives that speak Redis-protocol).

OCSP data can be represented reasonably well both as key-value and as tabular data. I think for performance reasons we should consider changes to how OCSP gets stored, too, but it's fundamentally different data, has a lot more tooling already around it that would have to change if we were to change its primary representation, and there's room for something Redis-like being used as a "cache" to unload the DBs without changing a primary source-of-truth. I would like thus to consider it separately.

aarongable commented 1 year ago

As JC said, we should split this. OCSP data in Redis is now fully supported. So let's repurpose this ticket for @beautifulentropy's investigation into creating a new lookaside rate limit service, and the database system that will back that service.

ringerc commented 11 months ago

I've just quote some time trying to concoct a query against the crt.sh public certificate transparency database that will expose a reasonable estimate of current Let's Encrypt quota usage for a given domain as a workaround for the unavailability of quota-info on the public Let's Encrypt APIs, so I'm delighted to see this ticket and PoC PR.

I realise this ticket and the PR do not explicitly call for public APIs to query quota assignments or quota consumption, and I'm not trying to drive-by scope-hijack this ticket into doing so. But I do suggest that it's worth considering what a future public API might require when designing and testing the new quota system so that it leaves room for the possibility as follow-on work.

Past posts about quota checking have tended to point people at [crt.sh]() - to use the public certificate transparency log and associated database maintained there to compute quota usage. But this method won't account for Let's Encrypt's FQDN-set renewal-detection logic, and it's tricky to get right, let alone in a performant way. Plus it's not ideal to shift load onto crt.sh for people trying to monitor LE quota. So it'd be ideal if this quota rework could serve as the foundation of future public quota-access APIs.

Relevant past threads on checking quota usage include:

and I recently started a crt.sh discussion thread where I posted a probably-horribly-wrong query against the crt.sh database for LE quota checking purposes.

May I also suggest as part of this that messages relating to quota issues report the quota limit along with a quota exceeded message? E.g. at https://github.com/letsencrypt/boulder/blob/7d66d67054616867121e822fdc8ae58b10c1d71a/ra/ra.go#L1442C3-L1442C3 the web-api front-end replies

        return berrors.RateLimitError(retryAfter, "too many certificates already issued for %q. Retry after %s", namesOutOfLimit[0], retryString)

which informs the caller that the relevant quota has been exceeded, and for which domain, but not what the numeric value of the quota for the domain is.

ringerc commented 11 months ago

Regarding rate-limit data being ephemeral and not requiring consistency guarantees - while it can be derived from the main SA db cert logs, it might still be expensive to compute from scratch on a service restart. So the ability to checkpoint it and re-compute it forwards from the last checkpoint is likely to be desirable.

I didn't see anywhere in the Boulder SA code that limits how far back in time the current LE storage engine code will look to find a past issue of the same FQDN-set when it's deciding whether a given cert order is a renewal or not. That seems to put the entire LE cert history in scope for the quota checker datastore because it has to be able to check if any given FQDN-set was ever issued before. Even if I missed some limit on this, or if LE defines a look-back time limit as part of the new quota system work, it'll presumably have to be at least the 90 day cert validity window plus some reasonable slush margin. That's a lot of data to scan and rebuild a unique-FQDN-set cache from if the ephemeral store is lost to a crash, restart etc. That could have a significant effect on backend load during reconstruction and/or time-to-availability.

Check-pointing the rate-limit state wouldn't have to be consistent or atomic; it could just err in favor of under-counting usage rather than over-counting so it never falsely reported quota to be exceeded. Recording the latest-issued cert after saving a checkpoint of the rate limit state would probably do; then it'd only have to scan-forward from there. If some certs issued are not counted because they got issued after the snapshot started and before the last-issued position was recorded in the DB, who cares? It won't be lots and any error in the computed quota will age out of the quota window for certs-per-domain and renewals quotas within 7 days anyway.

I don't understand how the store can be re-initialized with a starting state or efficiently "replayed" to recover past state based on the store in https://github.com/letsencrypt/boulder/blob/861161fc9f76f7b0cdb27c0b7e81d1572e4c5061/ratelimits/source.go .

Similarly, if quota state recording and lookup is going for non-atomic "close enough" measurement then erring on side of under-counting usage would make sense.

aarongable commented 11 months ago

Hi @ringerc! A few notes in no particular order:

We're not committing to anything I say here, this whole message is just the current state of our thinking, which will certainly continue to evolve.
We do plan to expose rate limit info directly to clients. We're considering two avenues for doing so:
- Including comprehensive rate limit headers in all responses. So not just a Retry-After header, but also a "how many tokens do you have left" header, and a "how long until you're back to max capacity" header, etc.
- Providing a "tell me about all of my current rate limit quotas" endpoint that clients could query to a json document detailing their current state
Don't think too hard about the way the current rate limits are implemented in the SA. The whole point of @beautifulentropy's PR linked above is that we plan to get rid of all of that -- the rate limit data will be stored entirely outside of the database. The comment about ephemerality is basically saying: if the rate limits storage system gets dropped, that's not the end of the world. We'll get some extra traffic for a bit until the limits start kicking in again, but we won't fall over and we won't run into any compliance problems. So it's fundamentally okay if the rate limits data (or some subset of it) just vanishes, and we don't intend to go to any effort to reconstruct it from the database. We're considering various different backing stores, based on mix of criteria like easy of deployability, reliability, speed, and more.

ringerc commented 11 months ago

@aarongable That makes a lot of sense.

The only issue I see with ephemerality is that LE quotas currently exclude "renewals" for one domain, where the FQDN set checked for renewals seems to be back in time forever. If that's lost, all incoming cert orders count against the new-domains quota not the renewals-quota, which could easily cause an org to exceed quota for what would normally be a way-lower-than-limits load.

There's an effectively unlimited number of domains that could be renewed within the quota window so no amount of padding of the new-domains quota to compensate for "forgotten" renewals would guarantee correctness. A domain-account could've issued 50 certs 180 days ago, then renewed those 50 and issued another 50 certs 90 days ago, and now want to renew all 100 + issue 50 more; it'd usually be able to rely on being able to do this for an indefinite number of 90 day cert validity cycles so there's no sensible quota to apply if knowledge of which FQDN-sets are renewals is suddenly lost.

That problem won't go away after one quota cycle, so enforcement of that quota can't just be turned off until one 7 day cycle is past to solve the issue.

I don't see an obvious solution for that with the current quota definitions unless the unique-FQDN-sets for past certs issued can be safely stored or reconstructed on loss.

Redefining the quotas themselves w.r.t to renewals is one option. For example, change the renewal quota rules to say that only renewals of certs for fqdn-sets that were last issued or renewed in the past (say) 100 days are guaranteed to be exempted from the new-certs quota. Most people will only renew certs to maintain continuous rolling validity so this should be the immense majority of renewals. Then if unique-fqdn-set info for renewal detection is lost, increase the new-domains quota to 3x the usual value for the first 90 day cert validity window after loss of renewal history. That would ensure that even if every renewal is miscounted as a new domain no previously-valid request would be rejected, and it wouldn't require any reconstruction of unique FQDN set info from the main SA DB. Or it could be reconstructed lazily after recovery, and the normal limits re-imposed once the FQDN-set info was reconstructed.

beautifulentropy commented 11 months ago

@ringerc Thanks for the feedback; I believe you're correct on these points. Our plan to account for this is to continue using our MariaDB database as the source of truth when determining whether a newOrder request is a renewal.

ringerc commented 11 months ago

@beautifulentropy Makes sense. And the quota layer can be used as a write-through cache over the SA MariaDB FQDN set tracking so it can still offload a lot of the cost of the FQDN-set checks. If a domain-set is added to the FQDN-set once it's never removed, so the cache doesn't need any complicated invalidation logic and is easy to pre-warm on restart.

So the idea here is to add some kind of ephemeral no-consistency-guarantees layer for storing and looking the quotas that enforce limits on new orders, unique FQDN-sets per TLD+1, and most others. The FQDN-set uniqueness checks used to detect renewals will continue to use the current SA code, either directly as now, or possibly via write-through caching via the quota store. Client visibility into quota status will be exposed via a rich set of quota status headers on all responses and/or via dedicated quota-lookup API endpoints.

Reasonable summary of the intent as it stands?

letsencrypt / boulder

Key-value based rate limiting #5545

6947

7016

7041

7042

7053

7054

7124

7089

7143

7199

7200

7258

7201