smallstep / certificates

🛡️ A private certificate authority (X.509 & SSH) & ACME server for secure automated certificate management, so you can use TLS everywhere & SSO for SSH.
https://smallstep.com/certificates
Apache License 2.0
6.4k stars 420 forks source link

Rate limit ACME requests #601

Open tashian opened 3 years ago

tashian commented 3 years ago

Sometimes ACME clients can misbehave and it's pretty easy to DoS step-ca in that case.

@MCWertGaming discovered an interaction between Caddy and step-ca that causes a flood of ACME requests, possibly triggered by the CA being unable to do a DNS lookup of the requested domain. See smallstep/certificates#598 for step-ca context and logs, and see also caddyserver/caddy#4186 for details of the Caddy side of things (potential issue with Caddy's ACME client).

Note: Rate limiting in ACME needs to return a rate limit error as defined in the RFC.

MCWertGaming commented 3 years ago

As I said earlier, I'd suggest that one domain can only have something like 5 requests per hour and unfinished request should probably get cleaned up after some time, because if they are not finished after 5 minutes, the client probably already tried to order a new one anyway.

MCWertGaming commented 3 years ago

I provided more Informations here: https://github.com/caddyserver/caddy/issues/4186#issuecomment-854134793.

TheSecMaven commented 3 years ago

We run into this issue ourselves when a client is not being a good actor. Rate limiting is part of the RFC https://datatracker.ietf.org/doc/html/rfc8555#section-6.6

dopey commented 3 years ago

Hey, we had the chance to discuss this issue in depth this morning. I'll do my best to transcribe our thoughts.

Our main question during the discussion was: what are we actually trying to prevent?

The "fixes" would be different depending on which of these scenarios we're trying to prevent.

A level deeper - we're weary of rate limiting because we don't want to introduce unintended consequences. e.g. Suppose we rate limit based on client IP. Well if the CA is being used in RA mode then we may expect a single client to making requests on behalf of many domains. Suppose we rate limit based on domain name, then I could DOS you from getting a cert if your CA were public and I knew your domain names. We could list scenarios for a while (and we did this morning), but our conclusion was that we didn't want to start fixing something without a clearer understanding of the underlying problem (is the server running out of memory, disk, open sockets, etc.). And, more generally, we believe that a client making too many requests is a client side bug and should be fixed on that side.

We'll leave the issue open because we want to collect more info about use cases and actual server issues (OOM, segfault, etc), but we're not planning any work here yet.

MCWertGaming commented 3 years ago

For me the problem was that the open requests made the internet database grow (for me 200mb was the limit, something like 206mb and the database was broken as described before). In the moment on which the db gets to that size, step-ca starts to use 100% CPU while doing nothing. The website still works, like ca.domain.com/health, but acme requests are not handled anymore. After a restart of step-ca either the db still works and step-ca is just running normally after, or badger is complaining about some unsafe config option I would need but it's not telling a reason. The more precise explanation is in the discussing if it would help you in any way.

Thanks for getting deeper into this!

MCWertGaming commented 2 years ago

Hey, any update on this? My CA broke again due to too many pending acme requests that couldn't succeed because the DNS entry was wrong or missing.

MCWertGaming commented 2 years ago

Caddy has fixed the bug already where it's flooding the step-ca server, but it's still a problem when the caddy server is creating requests over days.

Skinner927 commented 2 years ago

I've encountered the same "friendly" DoS attack where a DNS entry was incorrectly changed, step-ca could no longer resolve the domain, and Traefik ACME spammed step-ca until the drives on the step-ca host hit 100% and essentially bricked the host.

Rate limiting friendlies and/or some type of automatic logrotate or log pruning to prevent something like this would be welcomed.

dopey commented 2 years ago

@MCWertGaming @Skinner927 can you be more specific about what exactly failed? Did the DB of step-ca fill up? Was it the logs of step-ca that filled up?

dopey commented 1 year ago

@MCWertGaming I see that you provided more info on what exactly was failing in your case - the DB filling up. The problem, for us, is it's not clear what the DB is filling up with. Is the client creating a new account for each new attempt? Is it the same ACME order, or different ACME orders for each new attempt? If it's the same ACME order is the issue that the error log on the order continues to grow until it is unusable? Without knowing the specifics, it's hard for us to prioritize a fix. And, because different clients behave differently, fixing the issue for you may have no affect on another client.

MCWertGaming commented 1 year ago

I'm sorry for the late answer, but replicating this issue takes a week on my setup (since the Caddy webserver fixed their ACME timeout).

Basically what happens is that (in my case Caddy) tries to create ACME challenges with a single account. It basically opens challenges for a Domain that has no DNS entry in my local DNS server. The result is that step-ca is trying to do the challenge, but fails to reach the server through the domain. The problem is that step-ca is not removing the old entry from it's badger DB when the client tries to make a new challenge after a few minutes.

My experience was that step-ca broke after the DB reached 200mb. It seems like this issue was fixed? But I guess that the DB is now increasing until the system storage is full. You might need to investigate this a bit further.

Basically what you would have to do:

I hope this will work for you @dopey