Open abclution opened 3 years ago
Yes, resign issues, however, the setup is unconventional - started using Let's Encrypt certbot
for free TLS
certs before Virtualmin added support for letsencrypt
free TLS
certs, so the Virtualmin code to auto renew and probably subsequently auto resign DNSSEC
for domains protected by certs, doesn't appear to run.
The source of the SSL certs shouldn't cause this ... unless somehow you're using Let's Encrypt HTTPS certs for DNSSEC?
@jcameron I'm not, I don't think. What I mean is that I'm using a pretty much stock Virtualmin setup in these regards.
While I do have DNSSEC enabled and resigning working most of the time,...see above. I keep running into weird issues like above randomly and have discouraged fully enabling DNSSEC on my hosted domains (its early here I forget the name of the records that need to be sent to the registrar) as once DNSSEC is fully enabled, misconfigurations or things like above can cause security warnings and site access errors.
I have a suspicion it has to do with files in (DEBIAN) /var/lib/bind getting created with the wrong owner and permissions, somehow, somewhere.
For example, the majority of the files in there are set properly to the right ownership (bind/bind) but there is a weird smattering of files that are old (and in fact should have been cleaned up automatically) and are owned by the domain user account, instead of bind strangely enough. No, I didn't change their permissions.
For example the domain that I posted above, has the correct keyfiles that resaving the dns created yesterday, but also randomly some old keyfiles with totally wrong permissions in /var/lib/bind (but not the specific keyfiles resign.pl was complaining about!)
There is a smattering of old keyfiles owned by the wrong user mixed in there for various domains. No idea how they got there, either. As well as leftover files from previously deleted sites.
@jcameron True, the certs shouldn't break the DNSSEC
re-signing However, they're inter related to some extent. When the cert renewals fail, the domain names don't want to resolve, which causes slight issues on the DNSSEC
side, secure email notifications of this situation probably fail to get sent, many things break. When the DNSSEC
signatures expire without getting re-signed, then the domains/certs renewals/mail/web have issues because they're referring to an insecure domain with expired DNSSEC
signatures therefore many secure public resolvers return NXDOMAIN
i.e. non-existent domain. These two renewal processes - for certs and DNSSEC
signatures - are critical for security and accessibility of the domains hosted/managed by Virtualmin. They should be updated to be as self-healing and resilient and with as much perseverance to become as unbreakable as possible.
I assume the problem has nothing to do with DNSSEC itself but rather with DNS TLSA records are not being synced?
@abclution If you run the following command, does it solve your problem?
virtualmin modify-dns --domain virtual-server.name --sync-tlsa
@iliarostovtsev
Part of the reason why cert renewals is failing, is because certbot
needs to temporarily be the service binding and listening on ports 80/443 (http
/https
) to prove to the Let's Encrypt Certificate Authority
that it's requesting the cert on the expected IP address listed in DNS
records for the domain in order to receive this Domain-Validated cert (in our case it's secure DNSSEC
records). So we have nginx
service stop, while certbot
's doing its renewal process, which seems to take between 5 and 20 seconds per domain, times about 20 certs on this one particular Virtualmin server, equals somewhere between 100 and 400 seconds. However, systemd
is restarting nginx
every minute (60 seconds) like a watchdog timer, because it assumes nginx
bombed and wants to restart it so that nginx
's always up as it's a critical service that must always be available to both web users and web bots. So nginx
starts up and binds on to ports 80/443 which causes certbot
to fail the next cert renewal, as well as rest of its cert renewals in its list! So the script has to detect this, stop nginx
service yet again, and retry certbot
renewal process, until all the expired certs have successfully renewed. It's a race condition between systemd
watchdogging nginx
versus certbot
trying to renew more certs than can get renewed in one minute.
If you're using Virtualmin, certbot never needs to be run as a server like that, and shouldn't be configured to do it's own automatic renewals.
Pretty sure this issue was related to this sad story.
https://github.com/virtualmin/virtualmin-gpl/issues/336
Cron jobs not running reliably is problematic for dnssec. Let me see if things are still broken since I fixed cron.
Thanks, I'll take a look at that other bug..
I assume the problem has nothing to do with DNSSEC itself but rather with DNS TLSA records are not being synced?
@abclution If you run the following command, does it solve your problem?
virtualmin modify-dns --domain virtual-server.name --sync-tlsa
So a follow up, finally the issue happened again and I had time to look into it, fixing my cron did nothing to fix this problem as it is not solved by running resign.pl or other jobs that were set to run via cron.
In fact @iliajie was right, it has to do with the tlsa records and yes, that command did fix the issue. So, what job isn't running / syncing automatically there that I need to run it manually?
@abclution so did you again see an issue where an SSL cert changed, but the TLSA records weren't updated?
If so, how was the cert changed?
Considering zone xxxx.com Key count 2 Zone key in /var/lib/bind/Kxxxx.com.+005+29715.private Age in days 7.24450231481481 Re-signing of xxxx.com failed : Re-signing failed : dnssec-signzone: warning: dns_dnssec_keylistfromrdataset: error reading ./Kxxxx.com.+013+25861.private: file not found dnssec-signzone: fatal: No self-signed KSK DNSKEY found. Supply an active key with the KSK flag set, or use '-P'.
So occasionally I get cron resign failure and a detailed error like this when running resign.pl --debug script.
Not sure what causes this, but usually opening the domain bind/dns records file, and resaving it in the virtualmin control panel, and rerunning the resign.pl fixes it. Don't really know how to make it happen, but am I the only one?
It usually happens to a single domain out of the bunch. And yes, resign is set to run by cron.