DANE TLSA record misconfiguration (not updating TLSA hashes when new LetsEncrypt certs appear).

chris001 commented 4 years ago

Received a nice automated email today from Viktor Dukhovni (ietf-dane@dukhovni.org). He runs a tool that monitors domains which use TLSA records. And sends a notification email, when it detects botched key rotations, causing invalid broken TLSA records, and therefore, blockage to web services because the key/hash fails to match what is actually being served by the server! More info: https://www.isi.edu/~hardaker/presentations/2019-06-DANE-hardaker-dukhovni.pdf Basically, Virtualmin/Webmin is assuming (wrongfully) that it and only it is allowed to update/renew LetsEncrypt certs, and the TLSA hash generation is (wrongfully) dependent on that assumption. What we have is, outdated TLSA resource record hashes, that Webmin/Virtualmin (wrongfully) allows those invalid outdated hashes to sit there in BIND9 DNS, and serve wrong hashes to the world, which blocks access to the various web services running on the Webmin/Virtualmin server! The TLSA hash generation (and insertion/update into BIND9 DNS) must be updated to be independent. As soon as Webmin/Virtualmin detects a new, valid, cert is encrypting traffic on any of the server ports, it needs to update the TLSA DNS record so that it'll contain the correct hash of that new cert, so that BIND9 can then serve the correct TLSA record to the world. Regardless of whether it's Webmin/Virtualmin that invokes the cert creation/renewal, or an external script that invoked the cert creation/renewal. See also similar issues: #115 and #108 !

jcameron commented 4 years ago

Interesting, that's a case we hadn't considered - currently Virtualmin does indeed assume that it manages certs and thus knows when TLSA records need to be updated.

What if we provided a script that would re-sync TLSA records which you could call after manual updates, or even setup a cron job to run?

chris001 commented 4 years ago

What if we provided a script that would re-sync TLSA records which you could call after manual updates, or even setup a cron job to run?

A script to update the TLSA resource records would be great. It'd need to:

Generate all of the new TLSA resource records, and insert them into the relevant BIND9 DNS zone 2xTTL of the TLSA resource records before the old (LetsEncrypt) certs expire,
Make sure to trigger an automatic re-sign of the zone for DNSSEC handled by BOND9 if BIND9 is high enough version and this setting is enabled,
Make sure to trigger an automatically update to slave DNS servers, this depends on zone version number was incremented.
When the old (LetsEncrypt) certs expire, remove the old TLSA resource records, make sure to trigger an auto re-sign of the zone for DNSSEC, and make sure to trigger an auto update to slave DNS servers.

Reference: From slide 25, June 2019 ICANN65 presentation on DANE/TLSA :

Rolling Your TLS Keys

Use multiple TLSA records to publish current and future keys –Publish TLSA records of keys well in advance of using new certificates –Required by DNS caching (publish 2xTTL ahead)
Two pre-publishing models: –EE Key + Next EE Key: (3 1 1 + 3 1 1) –EE Key + TA Key: (3 1 1 + 2 1 1)
Deploy new chain, and publish new TLSA records: _25._tcp.mx.example.com. IN TLSA 3 1 1 **_curr-pubkey-sha256_** _25._tcp.mx.example.com. IN TLSA 3 1 1 **_next-pubkey-sha256_**

jcameron commented 4 years ago

Most of that we can do easily, except for the rolling update to the Let's Encrypt cert. This is kind of complex for any cert actually, as Virtualmin would need to publish the new cert in DNS for long enough to allow caches to expire before actually switching to it in Apache.

I wonder, can this problem instead be minimized by having a really short TTL on the TLSA records?

chris001 commented 4 years ago

I wonder, can this problem instead be minimized by having a really short TTL on the TLSA records?

3600 (seconds = 1 hour) is a decent short TTL to have on the TLSA records. Then, you only need to insert the new TLSA records, for the new (LetsEncrypt) TLS certs, 2 hours before the old (LetsEncrypt) TLS certs expire. When the old (LetsEncrypt) TLS certs do expire, you can remove/prune the old TLSA records from DNS.

jcameron commented 4 years ago

What if the TTL was only 60 seconds?

chris001 commented 4 years ago

What if the TTL was only 60 seconds?

Good question. I think you'd experience rather heavy DNS server load, because every time any client wanted to use a secure web service running on the virtualmin server, their validation of the security of the TLS cert served by virtualmin, would only last 60 seconds, meaning they could pretty much never cache it, so for example, a web or email client would have to re-validate the TLSA hash matches the hash of the cert for every single web page that a user browsed (assuming the user spends more than a minute reading each web page), or for every time the mail app checked for new email (assuming the user's mail client checks for mail every 2+ minutes).

chris001 commented 4 years ago

https://tools.ietf.org/html/rfc6698

A.4. Handling Certificate Rollover

Certificate rollover is handled in much the same way as for rolling DNSSEC zone signing keys using the pre-publish key rollover method [RFC4641]. Suppose example.com has a single TLSA record for a TLS service on TCP port 990:

_990._tcp.example.com IN TLSA 1 1 1 1CFC98A706BCF3683015...

To start the rollover process, obtain or generate the new certificate or SubjectPublicKeyInfo to be used after the rollover and generate the new TLSA record. Add that record alongside the old one:

_990._tcp.example.com IN TLSA 1 1 1 1CFC98A706BCF3683015... _990._tcp.example.com IN TLSA 1 1 1 62D5414CD1CC657E3D30...

After the new records have propagated to the authoritative nameservers and the TTL of the old record has expired, switch to the new certificate on the TLS server. Once this has occurred, the old TLSA record can be removed:

_990._tcp.example.com IN TLSA 1 1 1 62D5414CD1CC657E3D30...

This completes the certificate rollover.

jcameron commented 4 years ago

Got it - I see what needs to be done, it's just complex to implement given the way SSL cert renewals work currently.

chris001 commented 4 years ago

Got it - I see what needs to be done, it's just complex to implement given the way SSL cert renewals work currently.

Yes, it seems extra metadata, such as creation date/time, and/or expiration date/time, might need to be kept, so virtualmin will know with certainty which records are pruning candidates. Maybe that extra metadata could be stored cleverly in the comment of the relevant DNS resource record? The old TLSA records don't need to be pruned exactly on time, it doesn't hurt anything if they linger around longer than strictly needed, however, it's good to have the system management software be smart enough to know IF a TLSA record is old/eligible for pruning, so that WHEN the pruning script runs again, it can intelligently prune all the old records, and make the DNS zone clean of old garbage, and perfectly valid with a 100% score on the DNS zone tester tool sites.

jcameron commented 4 years ago

The complexity comes from the SSL cert replacement process - instead of just immediately starting to use it, Virtualmin would need to keep it separately until all cached records have expired, and then apply the cert in the background.

chris001 commented 4 years ago

The complexity comes from the SSL cert replacement process - instead of just immediately starting to use it, Virtualmin would need to keep it separately until all cached records have expired, and then apply the cert in the background.

Couldn't you request the new TLS certs (from LetsEncrypt), then, as soon as certbot successfully obtains them, immediately make the call to ldns-dane to generate the new TLSA, install it into the DNS zone, then reload or restart bind9 and the relevant secure web service? There'd be a second or two when the new TLSA record doesn't exist in the zone yet and new secure connections fail to validate the new TLS cert against the old TLSA hash of the secure web service, but it'd be the best we could do, wouldn't it?

jcameron commented 4 years ago

But what about clients using the old cached DNS records who connect to the server and see the new cert, which doesn't match them?

chris001 commented 4 years ago

But what about clients using the old cached DNS records who connect to the server and see the new cert, which doesn't match them?

WEB : If the https web browser client is a strict DANE-validating client, one that refuses to connect to secure web services with invalid or possibly forged TLSA hashes, then it may (and probably should) display an error to the user and refuse to connect to it. Retries will succeed after the TTL of the old expired TLSA records has passed, after which time they'll have aged out of the client's local DNS cache, and the DNS client must request the new TLSA records from the zone's authoritative DNS server.

In practice, none of the mainstream secure web browsers - Apple Safari, Firefox, Chrome, ChrEdge, Opera, Linux IceWeasel, etc., strictly validate and enforce the fact that DANE TLSA record hashes must match the hash of the TLS certificate presented to the client by the web server. The browsers just do "classic" Root CA TLS cert verification.

MAIL : Mail servers have more tools, in the form of plugins, for enforcing the rule that TLSA records must match the hash of the TLS cert being served live on the secure mail server port. Mail app users are more tolerant of waiting a TTL of time, as long as it's short, for example, 15 minutes, for the next secure mail checking TLS session, because their secure mail checking TLS sessions are usually 10x shorter than a 15 minute TTL, and occur in the background, so user experience can't be, and doesn't have to be, instant.

chris001 commented 4 years ago

@jcameron @swelljoe This Firefox browser addon (it might work on Chrome also), should help you more quickly troubleshoot whether the code is creating valid DNSSEC, and DANE/TLSA, DNS resource records.
It lets you refresh the web page, click on its orange padlock icon, and instantly see whether the DNSSEC and DANE/TLSA statuses be valid (green) or invalid (red). https://addons.mozilla.org/en-US/firefox/addon/httpspluschecker/

jcameron commented 4 years ago

So in that case, is there any benefit in keeping the DNS records for old cert around? It seems like it's only useful to have multiple if they are created before the new cert is installed.

chris001 commented 4 years ago

So in that case, is there any benefit in keeping the DNS records for old cert around? It seems like it's only useful to have multiple if they are created before the new cert is installed.

It'd seem to be the case. The instant the web services (dovecot, postfix, apache, nginx, openldap, etc) start serving the new TLS cert, you can safely delete the DANE TLSA resource records containing the hash of the old TLS cert (and resign the zone), and nobody should encounter an error, because nobody should look for the old version of those DANE TLSA records, since the hash they show would fail to match the hash of the new TLS cert.

chris001 commented 4 years ago

Update. Virtualmin is still failing to update/insert the new validTLSA aka TLS Authenticationrecords into DNS. Note: virtualmin absolutely must be robust enough to not assume that the only time a Let's Encrypt cert is renewed, is when Virtualmin renews it. This server admin, and probably many others, is renewing its own letsencrypt certs by cron job, because the Virtualmin software refuses to renew them, because they came into being on the server before Virtualmin added the LetsEncrypt feature...

jcameron commented 4 years ago

@chris001 if you are updating cert files outside of Virtualmin, you can force a re-sync of the TLSA records by running :

virtualmin modify-web --domain example.com --sync-tlsa

this can be run for all domains with :

virtualmin modify-web --all-domains --sync-tlsa

However, I'd recommend running it only for the renewed domain if you can.

chris001 commented 4 years ago

Having a hard time testing it! After running the command line, the TLSA record test site https://www.huque.com/bin/danecheck gives a White Screen Of Death on the virtualmin-hosted domains! Haven't checked deeper into it yet, just thought I'd share this result!

jcameron commented 4 years ago

Do the updated TLSA records look OK?

chris001 commented 4 years ago

The command line gives an error:

Virtual server myvirtualdomain.com does not have a web site enabled

When checking under Virtualmin, myvirtualserver.com, Edit Virtual Server, Enabled Features, both Nginx website enabled and Nginx SSL website are OFF (unchecked). Had to disable in order to manually get PHP-FPM working on Nginx because the config generator had been breaking the Nginx PHP-FPM config for all the virtual server nginx websites.

jcameron commented 4 years ago

Oh ... so this domain has TLSA records, but not a website (according to Virtualmin)? That's not a setup we support currently, sorry.

virtualmin / virtualmin-gpl

DANE TLSA record misconfiguration (not updating TLSA hashes when new LetsEncrypt certs appear). #145