smallstep / certificates

🛡️ A private certificate authority (X.509 & SSH) & ACME server for secure automated certificate management, so you can use TLS everywhere & SSO for SSH.
https://smallstep.com/certificates
Apache License 2.0
6.35k stars 415 forks source link

[Bug]: Step-CA fails on renewal #1855

Open MauriceMossIT opened 1 month ago

MauriceMossIT commented 1 month ago

Steps to Reproduce

Issue

I recently set up a new Step-CA server to work with a load balancer for ACME certificates using the HTTP-01 challenge. First time enrollments work well. However, trying to enroll or renew a certificate after that fails. I had this issue last year, but was never able to resolve it. I have some better information now, so I wanted to revisit it.

The error in the step-ca logs on the server is "response="{"type":"urn:ietf:params:acme:error:badNonce","detail":"Unacceptable anti-replay nonce"}"".

The load balancer originally worked for renewals when it was on an older code version. The newer version fails. The only change on the load balancer was an update to a newer version of the acme.sh script which I think might play a role.

Testing

Here is a snip of an enrollment, then another forced enrollment afterwards. image

In this pcap, you can see the first request effectively complete at packet 6; however, the server doesn't close the connection and sends a keep alive every 15 seconds. Packet 11 shows the start of a new enrollment. There's no TCP handshake here because the kept alive session is first used. This fails, and 30 seconds later the server sends a FIN/ACK. This is where I think the acme.sh version difference could be impactful. Note that between the FIN/ACK sent by the server and the next SYN sent, there is ~2.8 seconds of time during which the load balancer sends a RST/ACK. After the RST and SYN, the connection continues and the certificate is successfully enrolled again (using --force).

Here is a snip from the load balancer on newer code versions with an updated acme.sh version. image

The initial enrollment of a certificate with a newly started step-ca instance works just the same as the previous one. The longer duration between enrollments was simply a difference of when I triggered a new enrollment in this capture versus the previous one.

Looking at packet 21, we see the 2nd attempt at enrollment. Just like in the previous capture, 30 seconds later the server sends a FIN/ACK. However, this time the server sends a SYN ~1.4 seconds after the FIN/ACK. This is faster than the older version tests which I assume could be related to the updated acme.sh version. The load balancer sends a RST/ACK 2 seconds after the FIN (same amount of time in both captures). Then a new FIN/ACK is sent by the server and the enrollment fails.

Some additional testing was done from the load balancer using config that triggered the load balancer to immediately send a FIN/ACK after the 2nd enrollment attempt FIN/ACK was sent by the server. This occurs before the 2nd SYN and prevents the RST/ACK from ever being sent. You can see this in packet 18. This attempt also fails after the first enrollment. image

The last bit of testing I did was to use tcpkill on the server side to send a RST every time the connection completes. Doing this, I was able to consistently enroll certificates without issue.

Based on the captures and testing I've done, it seems like the server requires a RST to occur before a new certificate can be enrolled. Otherwise the server needs to be restarted each time. It's possible immediately closing the connection with a FIN/ACK after each certificate could also resolve this; however, I was only able to test with RSTs and not with FIN/ACKs.

Any and all tests can be replicated, and I am more than happy to discuss/demonstrate this behavior over a call as well.

Your Environment

Expected Behavior

Successful certificate renewal.

Actual Behavior

Renewal fails.

Additional Context

Is there any reason the step-ca server keeps the connection alive rather than closing it?

Is there a possible solution to this issue on the step-ca side?

This certainly does not appear to be an issue with the load balancer, and the load balancer works fine with the public Let's Encrypt servers.

Contributing

Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

ottobaer commented 1 month ago

Interestingly I get the same error with acme.sh in the bug I reported #1856

pandabaer.lan.ursidae.space:Challenge error: {"type":"urn:ietf:params:acme:error:badNonce","detail":"Unacceptable anti-replay nonce"}

ottobaer commented 1 month ago

Interestingly I get the same error with acme.sh in the bug I reported #1856

pandabaer.lan.ursidae.space:Challenge error: {"type":"urn:ietf:params:acme:error:badNonce","detail":"Unacceptable anti-replay nonce"}

FYI - The problem for me was that I set a dns server that doesn't exist anymore on the step-ca command line with '--resolver='

So I guess that is not related.

hslatman commented 1 month ago

Hey @MauriceMossIT, what load balancer (incl. version) are you using? I believe it's A10? What operating system is it running/based on? And what are the versions of acme.sh that (don't) work? Given that you changed the version of acme.sh and it then stopped working, it may be possible to narrow down to what changed the behavior.

Also, are connections to the CA going through the load balancer, or are they performed directly?

As far as I can tell, the CA uses Go's default keep alive for connections, which should be 15 seconds. This indeed seems to correspond to the packets you observed in the capture. I think in general it's a good thing the connection is kept alive, as long as it's actively being used by the ACME client. If I'm correct, we also set a timeout of 15 seconds by default. For a different issue we have a draft PR that alters these timeouts: https://github.com/smallstep/certificates/pull/1643; maybe a shorter timeout changes the behavior?

MauriceMossIT commented 1 month ago

Hello @hslatman,

It is A10 running any version after 5.2.1-P6. 5.2.1 versions prior to that work as P6 is when the ACME.sh version updated to 3.0.1. Previously version 2.8.6 was used.

The load balancer itself is performing ACME connection. The load balancer initiates the acme process to the step server, the server then checks DNS which is set to itself and uses bind for DNS, the challenge is this sent to a virtual server on the load balancer that is configured to reply to the challenge, then the server will send the certificate to the load balancer.

Once the certificate has been acquired, the lb should have no need to keep the connection alive as no further communication should be needed until a renewal occurs.

Let me know if there is anything specific in testing that would help. I can modify most things in the setup.