virtualmin / virtualmin-gpl

Virtualmin web hosting control panel for Webmin
https://www.virtualmin.com
GNU General Public License v3.0
308 stars 95 forks source link

Re-signing of xxxx.com failed : Re-signing failed : dnssec-signzone: warning: dns_dnssec_keylistfromrdataset: error reading ./Kxxxx.com.+013+25861.private: file not found #287

Open abclution opened 3 years ago

abclution commented 3 years ago

Considering zone xxxx.com Key count 2 Zone key in /var/lib/bind/Kxxxx.com.+005+29715.private Age in days 7.24450231481481 Re-signing of xxxx.com failed : Re-signing failed : dnssec-signzone: warning: dns_dnssec_keylistfromrdataset: error reading ./Kxxxx.com.+013+25861.private: file not found dnssec-signzone: fatal: No self-signed KSK DNSKEY found. Supply an active key with the KSK flag set, or use '-P'.

So occasionally I get cron resign failure and a detailed error like this when running resign.pl --debug script.

Not sure what causes this, but usually opening the domain bind/dns records file, and resaving it in the virtualmin control panel, and rerunning the resign.pl fixes it. Don't really know how to make it happen, but am I the only one?

It usually happens to a single domain out of the bunch. And yes, resign is set to run by cron.

chris001 commented 3 years ago

Yes, resign issues, however, the setup is unconventional - started using Let's Encrypt certbot for free TLS certs before Virtualmin added support for letsencrypt free TLS certs, so the Virtualmin code to auto renew and probably subsequently auto resign DNSSEC for domains protected by certs, doesn't appear to run.

jcameron commented 3 years ago

The source of the SSL certs shouldn't cause this ... unless somehow you're using Let's Encrypt HTTPS certs for DNSSEC?

abclution commented 3 years ago

@jcameron I'm not, I don't think. What I mean is that I'm using a pretty much stock Virtualmin setup in these regards.

While I do have DNSSEC enabled and resigning working most of the time,...see above. I keep running into weird issues like above randomly and have discouraged fully enabling DNSSEC on my hosted domains (its early here I forget the name of the records that need to be sent to the registrar) as once DNSSEC is fully enabled, misconfigurations or things like above can cause security warnings and site access errors.

I have a suspicion it has to do with files in (DEBIAN) /var/lib/bind getting created with the wrong owner and permissions, somehow, somewhere.

For example, the majority of the files in there are set properly to the right ownership (bind/bind) but there is a weird smattering of files that are old (and in fact should have been cleaned up automatically) and are owned by the domain user account, instead of bind strangely enough. No, I didn't change their permissions.

For example the domain that I posted above, has the correct keyfiles that resaving the dns created yesterday, but also randomly some old keyfiles with totally wrong permissions in /var/lib/bind (but not the specific keyfiles resign.pl was complaining about!)

image

There is a smattering of old keyfiles owned by the wrong user mixed in there for various domains. No idea how they got there, either. As well as leftover files from previously deleted sites.

chris001 commented 3 years ago

@jcameron True, the certs shouldn't break the DNSSEC re-signing However, they're inter related to some extent. When the cert renewals fail, the domain names don't want to resolve, which causes slight issues on the DNSSEC side, secure email notifications of this situation probably fail to get sent, many things break. When the DNSSEC signatures expire without getting re-signed, then the domains/certs renewals/mail/web have issues because they're referring to an insecure domain with expired DNSSEC signatures therefore many secure public resolvers return NXDOMAIN i.e. non-existent domain. These two renewal processes - for certs and DNSSEC signatures - are critical for security and accessibility of the domains hosted/managed by Virtualmin. They should be updated to be as self-healing and resilient and with as much perseverance to become as unbreakable as possible.

iliajie commented 3 years ago

I assume the problem has nothing to do with DNSSEC itself but rather with DNS TLSA records are not being synced?

@abclution If you run the following command, does it solve your problem?

virtualmin modify-dns --domain virtual-server.name --sync-tlsa
chris001 commented 3 years ago

@iliarostovtsev Part of the reason why cert renewals is failing, is because certbot needs to temporarily be the service binding and listening on ports 80/443 (http/https) to prove to the Let's Encrypt Certificate Authority that it's requesting the cert on the expected IP address listed in DNS records for the domain in order to receive this Domain-Validated cert (in our case it's secure DNSSEC records). So we have nginx service stop, while certbot's doing its renewal process, which seems to take between 5 and 20 seconds per domain, times about 20 certs on this one particular Virtualmin server, equals somewhere between 100 and 400 seconds. However, systemd is restarting nginx every minute (60 seconds) like a watchdog timer, because it assumes nginx bombed and wants to restart it so that nginx's always up as it's a critical service that must always be available to both web users and web bots. So nginx starts up and binds on to ports 80/443 which causes certbot to fail the next cert renewal, as well as rest of its cert renewals in its list! So the script has to detect this, stop nginx service yet again, and retry certbot renewal process, until all the expired certs have successfully renewed. It's a race condition between systemd watchdogging nginx versus certbot trying to renew more certs than can get renewed in one minute.

jcameron commented 3 years ago

If you're using Virtualmin, certbot never needs to be run as a server like that, and shouldn't be configured to do it's own automatic renewals.

abclution commented 2 years ago

Pretty sure this issue was related to this sad story.

https://github.com/virtualmin/virtualmin-gpl/issues/336

Cron jobs not running reliably is problematic for dnssec. Let me see if things are still broken since I fixed cron.

jcameron commented 2 years ago

Thanks, I'll take a look at that other bug..

abclution commented 2 years ago

I assume the problem has nothing to do with DNSSEC itself but rather with DNS TLSA records are not being synced?

@abclution If you run the following command, does it solve your problem?

virtualmin modify-dns --domain virtual-server.name --sync-tlsa

So a follow up, finally the issue happened again and I had time to look into it, fixing my cron did nothing to fix this problem as it is not solved by running resign.pl or other jobs that were set to run via cron.

In fact @iliajie was right, it has to do with the tlsa records and yes, that command did fix the issue. So, what job isn't running / syncing automatically there that I need to run it manually?

jcameron commented 2 years ago

@abclution so did you again see an issue where an SSL cert changed, but the TLSA records weren't updated?

If so, how was the cert changed?