LE SSL certs not renewing - due malformed json error and querry re all certs being archived for a single sites' verify task.

EdNett commented 1 month ago

Hello, I have found an error in the let's encrypt certificate renewal (regeneration) process which is causing certificate renewal to fail, and also I have a query on somthing else happening with LE on the site verify task, which might be affecting all sites on the octopus instances' renewal.

No new certs are being placed in the /data/disk/o2/tools/le/certs/domain.com/ directory, even though a site verify task (and I have tried disabling ssl, verifying, and then re-enabling again and re-verifying) shows that new certs are being generated. The ownerships and permissions of the le/certs folders are all correct as far as I can tell. I can't seem to find a dedicated le log.

The relevant part of the site "domain.com" verify task is here:

https://gist.github.com/EdNett/0ed06e457b98c25f44facaf06ff4484a

Please see line 151 for the error ("type": "urn:ietf:params:acme:error:malformed", "detail": "JWS verification error", "status": 400)

and on line 157 I see that a site verify task for one domain is causing 13 different site certificates to be moved to the archive folder - can this be intended behavior? Wouldn't this cause all other sites' ssl cert renewal to fail ?

boa info here: https://gist.github.com/EdNett/e0eda12ecd6e73286cced67e89349d29

I can't seem to find where the script for the site verify task is, or I could probably work on this myself.

Best,

Ed

omega8cc commented 1 month ago

Is there any particular reason why you have _SKYNET_MODE=OFF? This setting is intended only for legacy systems that are no longer supported. Disabling it on supported systems can prevent the system's auto-healing functionality from working properly and may lead to unexpected issues.

As for the issue itself, we haven't been able to reproduce it in DEV or PRO environments, so it might be specific to LTS. Further debugging would be required, but with the upcoming release approaching, it’s unlikely this will be addressed soon.

EdNett commented 1 month ago

Hello,

I don't really understand _SKYNET_MODE - in general I would rather test all changes on our super-cheap test vps first before adn then do a regular up-lts rather than having any changes "pushed" to a production vm or server. Is that what SKYNET_MODE=ON does ? It pushes fixes and other global distro updates or hotfixes to all servers in BOA with _SKYNET_MODE=ON ? Does it email a notice that a fix or update has been made ? How do I know when it has been used? Etc. these kinds of concerns, that's all. If that's the only way to get a hotfix for this, then I have already re-ran up-lts with SKYNET_MODE=ON .

Should I replace all the 13 ssl certs which were moved to the /archive/ folder (re above) back to each domains' cert folder? WOuld that be a good idea, for example?

Thanks Ed.

omega8cc commented 1 month ago

_SKYNET_MODE=ON is responsible for maintaining auto-healing functionality and ensuring that it remains in good condition. It helps keep BOA tools operational, which includes also critical checks on components like cURL, Python, and Lshell, ensuring they function properly without issues.

We always have it enabled on all production servers, which should give you confidence that it’s safe for production environments.

It doesn't send any notifications, but it logs its actions in /var/xdrago/log

BOA is designed to self-maintain and even self-upgrade, as long as the optional cron entries are in place. It’s built with the expectation that you are using a supported system and not making changes beyond the hosted Aegir sites. When used this way, BOA functions flawlessly.

However, if you take actions outside of the standard BOA upgrade processes (using Barracuda and Octopus), such as manually installing packages, changing default settings, or disabling _SKYNET_MODE, you are assuming full responsibility for any issues that arise. BOA may become confused by manual interventions and behave in unpredictable ways that we cannot control.

In short, if you allow BOA to operate in its intended zero-touch manner, it will run without problems for years. But if you disable _SKYNET_MODE or make manual changes, you are on your own, and we may not be able to assist you.

EdNett commented 1 month ago

Ok Thanks for the clarification. I didn't understand that at all.

EdNett commented 1 month ago

This probably is't the reason for these ssl cert renewal failuers, but it does have something in common with issue #1817, I believe:

for each of our domains we have only IPv4 DNS entries for domain.com and www.domain.com, but we also have both IPv4 and IPv6 entries on our mail server - a completely differnet server of course and not boa at all) for mail.domain.com and lets encrypt certificates are generated and exist (using ISPConfig3 for this generation of LE certificates for the subdomain mail.domain.com - which has its own vhost entry - served by apache 2.4 - thus it is setup as an independant domain, not as a subdomain).

Is there any chance that this could be causing the boa domain.com + www.domain.com LE cert renewal ? This has never been a problem before, and I have been doing this for many years successfully - with the ISPConfig3 generated LE cert on mail.domain.com and BOA generated LE cert for domain.com and www.domain.com.

omega8cc commented 1 month ago

It could be still related because the SKYNET is taking care of LE tools health check and fixes if needed too.

If you have different server using subdomain not included in your Aegir settings for the main domain and other aliases it shouldn’t matter, but honestly we never tested such scenarios, so perhaps LE servers are doing some checks based on their dns logic which cause problems, but we don’t see anything on the BOA side to cause such problems.

It also no longer matters if you have IPv6 entries on your DNS since BOA instructs LE to only care about IPv4 for now.

_{Sent with GitHawk}

EdNett commented 1 month ago

I have checked everything I can think of and the ONLY thing that is different on this non-le-renewing-vm and an older vps which appears to still be renewing le certs is:

non-working :

curl --version curl 8.9.1 (x86_64-pc-linux-gnu) libcurl/8.9.1 OpenSSL/3.0.15 zlib/1.2.11 brotli/1.0.9 libidn2/2.3.0 libpsl/0.21.0 nghttp2/1.43.0 librtmp/2.3 OpenLDAP/2.4.57 Release-Date: 2024-07-31 Protocols: dict file ftp ftps gopher gophers http https imap imaps ipfs ipns ldap ldaps mqtt pop3 pop3s rtmp rtsp smb smbs smtp smtps telnet tftp Features: alt-svc AsynchDNS brotli HSTS HTTP2 HTTPS-proxy IDN IPv6 Largefile libz NTLM PSL SSL threadsafe TLS-SRP UnixSockets server1:~#

working :

curl --version curl 8.9.1 (x86_64-pc-linux-gnu) libcurl/8.9.1 OpenSSL/3.0.15 zlib/1.2.11 brotli/1.0.9 zstd/1.5.4 libidn2/2.3.0 libpsl/0.21.0 nghttp2/1.43.0 librtmp/2.3 OpenLDAP/2.5.13 Release-Date: 2024-07-31 Protocols: dict file ftp ftps gopher gophers http https imap imaps ipfs ipns ldap ldaps mqtt pop3 pop3s rtmp rtsp smb smbs smtp smtps telnet tftp Features: alt-svc AsynchDNS brotli HSTS HTTP2 HTTPS-proxy IDN IPv6 Largefile libz NTLM PSL SSL threadsafe TLS-SRP UnixSockets zstd server1:~#

note the minor difference in OpenLDAP version and the missing zstd in the newer non-working vm. This is about all I can find. Best, Ed

EdNett commented 1 month ago

Would upgrading CURL to 8.10.0 or 8.10.1 help? I see that ther are a huge number of bugfixes in 8.10.0

8.9.1 has a vulnerability, too:

curl version 8.9.1 was released on July 31 2024

It has the following 1 published security problem.

Flaw | From version | To and including -- | -- | -- OCSP stapling bypass with GnuTLS | 7.41.0 | 8.9.1

Also, what about adding HTTP/1.1 to the CURL build - that solved similar problems in the past.

Still not working, of course, now 3rd domain down (no https://). I get the same error when trying to use the dehydrated script (which staes version is 0.7.2 - but latest is 0.7.1!) to force a renewal from the command line.

By the way is there suppossed to be a copy of the /var/xdrago/conf/dehydrated script also in /data/disk/o2/tools/le/ ? (there was NOT on this server - I copied it there to try to force renewal manually from the command line. The problem MUST either be in the CURL build or the dehydrated script, in my opinion, and I just can't figure it out.

EdNett commented 1 month ago

To further test and debug this, I deleted a site for which le cert renewal was not woring, and whose cert had already expired, deleting all references to it everywhere - even in the database. I then reinstalled it on the same platform, and after an hour re-enabled ssl, and the same error happened, but two additional errors occurred which might be important - the creation of symlinks failed in two spots - gist of relevant part of verify task is here: https://gist.github.com/EdNett/586a7320d2a1ba5def096cb9e3970ee4

EdNett commented 1 month ago

Does this problem only exist on Chimera?

Is Chimera-to-Daedalus upgrade supported and working? (It wans't working a month ago at all).

Best, Ed

reswild commented 1 month ago

I'm running BOA 5.4.0-lts on Chimera, and I haven't noticed any issues with renewing certs.

From your error message, it sounds like you're not using the correct account key with your renewal requests.

EdNett commented 1 month ago

Thanks for that info - I'm closing this since I destroyed that server as it had been compromised.

omega8cc / boa

LE SSL certs not renewing - due malformed json error and querry re all certs being archived for a single sites' verify task. #1818