virtualmin / virtualmin-gpl

Virtualmin web hosting control panel for Webmin
https://www.virtualmin.com
GNU General Public License v3.0
332 stars 102 forks source link

TLSA notification received from DANE Survey bot. #414

Open chris001 opened 2 years ago

chris001 commented 2 years ago

Note: This bug probably would be fixed, and DANE Survey bot would rate TLSA records correct, if Virtualmin was to configure BIND settings to have BIND do fully automatic DNSSEC record generation, record validation, key management and key rollover with respect to the TTLs of DNSSEC records, as mentioned in comments on #410

From | DANE Survey Notices <dane-survey-notices@dnssec-stats.ant.isi.edu>
Cc | postmaster@mydomain.com , webmaster@mydomain.com
Date | 04/07/2022 20:28
Subject | mail.mydomain.com: SMTP server DNS (DANE TLSA record) issue

About the DANE Survey:  https://stats.dnssec-tools.org/about.html
DANE Survey Statistics: https://stats.dnssec-tools.org/

[ This is becoming a regular event.  Any chance the underlying process can be
  fundamentally fixed and monitored to avoid future intermittent outages? ]

The TLSA RRsets of some of your email servers do not match their actual
certificate chains.  Issue details for the affected domains:

    mydomain.com

can be seen at:

    https://stats.dnssec-tools.org/explore/?mydomain.com

The issues can be resolved by removing or updating the associated DNS
DANE TLSA records.

    https://letsdns.org/
    https://raf.org/danectl/
    https://community.letsencrypt.org/t/please-avoid-3-0-1-and-3-0-2-dane-tlsa-records-with-le-certificates/7022/17
    https://mail.sys4.de/pipermail/dane-users/2018-February/000440.html

-- 
        Viktor.

Details web page:

The DANE TLSA records of these MX hosts fail to validate their certificate chains. Inbound email may be delayed or not delivered.

The TLSA records fail to validate the certificate chain at one or more IP addresses. Inbound email may be delayed or not delivered.

jcameron commented 2 years ago

So is this a separate bug from https://github.com/virtualmin/virtualmin-gpl/issues/410 , or a just a side-effect of DNSSEC signing not happening?

chris001 commented 2 years ago

They're very closely related. As you can see, these records are getting validated by servers on the internet.

This one is about TLSA records going stale/invalid/fail due to auto updated Let's Encrypt certs not triggering refreshing dependent DNS records such as TLSA, because TLSA records contain a hash of the domain's TLS cert that it promises TLS clients should expect to receive on that particular port of that domain.

The best solution, for recent enough versions of BIND, might be to enable the newest config settings for BIND to intelligently and automatically detect relevant changes/expiration, and regenerate all stale DNSSEC related DNS records, including TLSA. I listed the new BIND settings here.

chris001 commented 2 years ago

Follow up. DNSSEC TLSA validating mail servers on the internet (all the top mail services with security reputation as a priority) are refusing to deliver mail to this Virtualmin server with invalid DNSSEC TLSA DNS records. (Spam is getting delivered fine, spam senders are not so much about the validating.)

The TLSA records became invalid because Lets Encrypt TLS cert renewed, and nobody regenerated the TLSA records to match the new LE cert. BIND has a config setting to automatically regenerate DNSSEC TLSA however Virtaulmin isn't applying those BIND config settings.

Error:

This is the mail system at Protonmail.

Your message could not be delivered for more than 12 hour(s).
It will be retried until it is 2 days old.

<_______@________.com>  Server certificate not verified.

Diagnostic-Code: X-Postfix: Server certificate not verified
chris001 commented 2 years ago

See also #115 and #145

chris001 commented 2 years ago

I guess Virtualmin should run this command every time it installs new certs, or just configure BIND to automatically do full DNSSEC maintenance with the config settings and BIND policies.

sudo /usr/sbin/virtualmin modify-dns --all-domains --sync-tlsa
sudo /usr/sbin/rndc reload

Edited, need to reload records.

chris001 commented 2 years ago

The following SHOULD work for testing. It enables you to upgrade BIND settings to use the newest features that let it handle KSK and ZSK key refreshing/rotation, so that expired keys won't be used beyond their TTL expiration by caching internet DNS servers, causing DNSSEC failure on your domain and secure DNSSEC validating internet clients unable to access your domain due to DNSSEC key mismatch between your cached key which they have retrieved from their caching DNS resolver, and your new key published recently on your domain's DNS authoritative name server:

BIND configuration, change this setting from: auto-dnssec maintain; to dnssec-policy alg13-ksk-unlimited-zsk-60day;

BIND policy file:

dnssec-policy alg13-ksk-unlimited-zsk-60day {
     keys {
         ksk key-directory lifetime unlimited algorithm ECDSAP256SHA256;
         zsk key-directory lifetime P60D algorithm ECDSAP256SHA256;
     };
};
jcameron commented 2 years ago

So, Virtualmin should already be running the equivalent of modify-dns --sync-tlsa after renewing a Let's Encrypt cert, or in fact any time an SSL cert is updated. I wonder, is it perhaps just not restarting BIND properly?

chris001 commented 2 years ago

Yes, you do need to reload the DNS zone records with that rndc reload, if not, BIND continues to serve the stale invalid TLSA records until their TTL expires, which could be days or weeks, during which time all fully secure anti-MIM attack DNSSEC DANE TLSA validating mail servers will refuse to connect to your mail server because the cert received is not equal to the cert for that service port in the zone's TLSA record.

jcameron commented 2 years ago

Odd that Virtualmin isn't doing this automatically, because it should be already sending the reload signal to BIND after any records are updated.

If you use the DNS Records page in the Virtualmin UI to add a test record to a domain, can it be queried immediately?

chris001 commented 2 years ago

Just added new record thru DNS Records page in Virtualmin UI, and you're right it queries immediately. However, as doing the modify-dns --sync-tlsa command, resulted in TLSA records continued to remain invalid for some time. Probably the reason the new valid TLSA records started to appear and DANE validated MTA mail connections working is due to the DNS zone default TTL 3600 seconds (one hour) had passed, and after that time, the cached TLSA records expired on internet caching DNS resolvers, so they loaded the new TLSA records from this authoritative server.

jcameron commented 2 years ago

Oh yeah, the TTL could cause this. Are the records still out of date if you query the nameserver on your Virtualmin system directly?

chris001 commented 2 years ago

The records are up to date when you query the Virtualmin system directly, and when you query some internet resolvers such as CF 1.0.0.1 apparently doesn't cache for the zone TTL (?), however you get stale records for maybe a minute when you query the popular 8.8.8.8. Test it yourself, make a new DNS AAAA address record, nslookup it locally, on CF, and on google. Modify the same AAAA record address, look it up locally, on CF it should be the new one, and google is still the old one, just for a short time.

chris001 commented 2 years ago

This shell script might be good to look at, see if there is any unknown gap in the existing methods.

jcameron commented 2 years ago

Ok .. that seems like expected behavior then. DNS isn't guaranteed to update immediately across all resolvers and caches.

chris001 commented 2 years ago

This DANE TLSA mailing list message has a practical strategy to prevent these cached expired TLSA records which fail validating connections.

This danectl library implements this strategy for you:

Subject: A sensible "3 1 1" + "3 1 1" key rotation approach
From: Viktor Dukhovni ietf-dane at dukhovni.org
Date: Fri Feb 16 21:05:20 CET 2018

To avoid (even temporary) mismatches always publish multiple (two
are enough) TLSA records.  One for matching the current certificate
chain, and another matching the *future* certificate chain.

You might ask how the *future* certificate chain can be predicted,
but the answer is simple enough.  While you may not know all the
certificate details, you can control the public key that goes into
the future certificate.  This can be matched with a "3 1 1" record.

Therefore, the recommended key rotation approach is:

  1.   Whenever you *deploy* a certificate chain with a new key,
       at the same time (that way you won't forget later) generate
       the next key!  And that time update your TLSA records to match
       both keys.  You only need the next public key for this, the
       next private key could be password protected if you like, but
       for most sites just rotating the keys often and letting the OS
       protect the keys from all but the authorized account is enough.

       _25._tcp.smtp.example.com. IN TLSA 3 1 1 <sha256(curr key)>
       _25._tcp.smtp.example.com. IN TLSA 3 1 1 <sha256(next key)>

  2.   When it is time to obtain a new certificate, generate the CSR
       from the previously generated *next* key (this may require
       decryption of the private if stored encrypted initially) and
       request the certificate, *but* do that only if the corresponding
       TLSA record is already published!  Do the DNS lookup to verify.

  3.   To deploy the newly obtained certificate go back to step 1.

-- 
    Viktor.
jcameron commented 2 years ago

Interesting approach, but would that even be possible when using Let's Encrypt, as the certbot tool generates the private key at the time the cert is requested? The only solution I can see that might work is to not install the new cert right away, and instead generate TLSA records from it, wait for them to propagate via DNS, then install the cert. But that adds a fair bit of complexity to the whole process ...

jcameron commented 2 years ago

Alternately, could this be solved by just setting a really low TTL on TLSA records?

chris001 commented 2 years ago

The tradeoff to lower TTL would be, higher BIND DNS query load on the Virtualmin hosted authoritative DNS zone, which can grow into something resembling a DNS DDoS attack storm, the more popular your hosted DNS sites become, the more that the growing number of validating clients have to lookup those uncached TLSA records. First, we should investigate why the default zone TTL of 3600 seconds was apparently ignored by CF and google public resolvers, at least for AAAA records. Test if they ignore TTL also for TLSA records during their refresh with new records when the Let's Encrypt TLS certs renew, and take a poll what these libraries do with TLSA TTL timeouts on their secured DNSSEC zones.

jcameron commented 2 years ago

3600 seconds is pretty long though, even if it's not ignored. That could mean a full hour in which the SSL cert appears to be invalid for some clients, if they are still using the old TLSA records?

chris001 commented 2 years ago

Agree, cached stale records out there is no bueno, however, we must reproduce this repeatedly reliably again before making a solution. My recent test was only on AAAA records which could have a different cache policy than the influential TLSA records. To force this bug again for accurate observation, is there a standard command line to have Virtualmin force an earlier than scheduled LE cert renewal on a specified domain?

jcameron commented 2 years ago

Yeah, you can force a cert renewal by running virtualmin generate-letsencrypt-cert --domain yourdomain.com

chris001 commented 2 years ago

I did research and ran tests on TLSA records expiring, just to see how it broke and would trigger a notification from the DANE TLSA survey bot. Now I know why the best and only reasonable working solution they recommend is to do the two TLSA records per port/protocol/fully qualified hostname combination. Because you need one for Current TLSA and one for Next TLSA, so that when the Current expires (forced expiration with early cert renewal, or cert truly ran to or near its expiration date), the switchover to the new TLSA will be smooth with zero downtime because this new TLSA is already in the DNS zone, so your users get saved from suffering under with the time to propagate brand new never before seen TLSA records out from your master zone, to all caching DNS resolvers on the internet. Because some caching DNS resolvers are quick like seconds to begin to return the new TLSA records, and some take more time than the TTL to answer with the newly propagated TLSA records! So the "Reduce TTL method" is unreliable and should not be used as a method to make the internet's caching DNS resolvers expire those old TLSA records sooner, in reality that is too sluggish of a method, compared to the Current and Next TLSA records method.

jcameron commented 2 years ago

Thanks for the research! It's a pity the option to add the TLSA records in advance is so complex to implement in practice..

chris001 commented 2 years ago

Agreed, it's a pity, however it's because users sometimes need/want to renew their certs with no delay, so there's no time to reduce TTL on the zone and let the old TTL pass before the zone has propagated to the world. This Current and Next TLSA record approach was brought up and agreed by IETF in 2011. Luckily there are two open source libraries which help with these Current and Next TLSA records.

  1. danectl
  2. LetsDNS
chris001 commented 2 years ago

Do you want these steps to reproduce and verify this invalid expired TLSA record?

jcameron commented 2 years ago

I understand the process, the tricky part would be re-designing how Virtualmin does SSL cert requests to add a delay between when the cert is generated, and when it is installed. Currently all the cert renewal code does both in a single operation, which isn't easy to split up.

Maybe the only practical work-around is to set a lower TTL on the TSLA records by default?

chris001 commented 2 years ago

I guess that might be a workaround until the libraries that do the two keys per TLSA record (current and next) can be added. How long is most website visits... 5min? 10min?

chris001 commented 2 years ago

Again is coming same notification email from DANE TLSA survey bot!

What would explain it is.. Failed/mismatch/expired TLSA records being served by BIND9 DNS, 1 day after the Lets Encrypt TLS cert renewed.

https://stats.dnssec-tools.org/explore/?virtualserver.tld

 Issues found with the
virtualserver.tld
domain:

The DANE TLSA records of these MX hosts fail to validate their certificate chains. Inbound email may be delayed or not delivered.
mail.virtualserver.tld  xxx.xxx.xxx.xxx     no TLSA record match
mail.virtualserver.tld  2xxx:xxxx:xxxx:xxxx::xxxx   no TLSA record match 

 mail.virtualserver.tld
The TLSA records fail to validate the certificate chain at one or more IP addresses. Inbound email may be delayed or not delivered.
DANE-EE (3) Cert (0)    SHA2-256 (1)    cf05ae132fc7c21dc9b2675ed093a862ac65e919d0ec0127ba3496439f724bbe

https://dane.sys4.de/smtp/virtualserver.tld

SMTP (fail icon)

The domain lists the following MX entries:

5 mail.virtualserver.tld

    IP Addresses
    xxx.xxx.xxx.xxx
    2xxx:xxxx:xxxx:xxxx:0:0:0:xxxx

    Usable TLSA Records
    3, 0, 1 cf05ae132fc7c21d[...]ba3496439f724bbe - certificate not trusted: (27) - certificate not trusted: (27)
chris001 commented 1 year ago

Here's a simple workaround to get no downtime on the TLSA records with LetsEncrypt certbot because, as you said @jcameron , you can't really get the "future next" key pair from LetsEncrypt, at least not now.

jcameron commented 1 year ago

So the fix is just to use certbot renew --reuse-key ? That seems like something we could do pretty easily ...

chris001 commented 1 year ago

Yes, because certbot doesn't have a way for you to create a "future key"... the way cerbot currently works, any key you create, is the current key. 😞

jcameron commented 1 year ago

Ok, I'll look into this ...

jcameron commented 1 year ago

Ok, the next release of Webmin will re-use existing Let's Encrypt keys by default..

shoulders commented 6 months ago

@chris001 Did this get fixed?

chris001 commented 6 months ago

@shoulders

Close but not 100%.


The domain lists the following MX entries: 5 mail.virtualsubserver.tld DNSSEC ✔️ TLSA ✔️ SMTP ❌ All TLSA RRs failed. (See details.)

IP Addresses aaa.bb.ccc.ddd AAAA:BBB:CCCC:DDD:E:F:G:H

Usable TLSA Records 3, 0, 1 1dc4e0cbfab83ac0[...]cdaf9998b83a9c34 - certificate not trusted: (27) - certificate not trusted: (27)


jcameron commented 6 months ago

Looks like this will be fixed by the change suggested here : https://github.com/virtualmin/virtualmin-gpl/issues/803