Open chris001 opened 2 years ago
So is this a separate bug from https://github.com/virtualmin/virtualmin-gpl/issues/410 , or a just a side-effect of DNSSEC signing not happening?
They're very closely related. As you can see, these records are getting validated by servers on the internet.
This one is about TLSA
records going stale/invalid/fail due to auto updated Let's Encrypt certs not triggering refreshing dependent DNS
records such as TLSA
, because TLSA
records contain a hash of the domain's TLS
cert that it promises TLS
clients should expect to receive on that particular port of that domain.
The best solution, for recent enough versions of BIND
, might be to enable the newest config settings for BIND
to intelligently and automatically detect relevant changes/expiration, and regenerate all stale DNSSEC
related DNS
records, including TLSA
. I listed the new BIND
settings here.
Follow up. DNSSEC
TLSA
validating mail servers on the internet (all the top mail services with security reputation as a priority) are refusing to deliver mail to this Virtualmin server with invalid DNSSEC
TLSA
DNS
records. (Spam is getting delivered fine, spam senders are not so much about the validating.)
The TLSA
records became invalid because Lets Encrypt TLS
cert renewed, and nobody regenerated the TLSA
records to match the new LE
cert. BIND
has a config setting to automatically regenerate DNSSEC
TLSA
however Virtaulmin isn't applying those BIND
config settings.
Error:
This is the mail system at Protonmail.
Your message could not be delivered for more than 12 hour(s).
It will be retried until it is 2 days old.
<_______@________.com> Server certificate not verified.
Diagnostic-Code: X-Postfix: Server certificate not verified
See also #115 and #145
I guess Virtualmin should run this command every time it installs new certs, or just configure BIND
to automatically do full DNSSEC
maintenance with the config settings and BIND
policies.
sudo /usr/sbin/virtualmin modify-dns --all-domains --sync-tlsa
sudo /usr/sbin/rndc reload
Edited, need to reload records.
The following SHOULD work for testing. It enables you to upgrade BIND
settings to use the newest features that let it handle KSK
and ZSK
key refreshing/rotation, so that expired keys won't be used beyond their TTL
expiration by caching internet DNS servers, causing DNSSEC
failure on your domain and secure DNSSEC
validating internet clients unable to access your domain due to DNSSEC
key mismatch between your cached key which they have retrieved from their caching DNS resolver, and your new key published recently on your domain's DNS authoritative name server:
BIND configuration, change this setting from:
auto-dnssec maintain;
to
dnssec-policy alg13-ksk-unlimited-zsk-60day;
BIND
policy file:
dnssec-policy alg13-ksk-unlimited-zsk-60day {
keys {
ksk key-directory lifetime unlimited algorithm ECDSAP256SHA256;
zsk key-directory lifetime P60D algorithm ECDSAP256SHA256;
};
};
So, Virtualmin should already be running the equivalent of modify-dns --sync-tlsa
after renewing a Let's Encrypt cert, or in fact any time an SSL cert is updated. I wonder, is it perhaps just not restarting BIND properly?
Yes, you do need to reload the DNS zone records with that rndc reload
, if not, BIND
continues to serve the stale invalid TLSA
records until their TTL
expires, which could be days or weeks, during which time all fully secure anti-MIM attack DNSSEC
DANE
TLSA
validating mail servers will refuse to connect to your mail server because the cert received is not equal to the cert for that service port in the zone's TLSA
record.
Odd that Virtualmin isn't doing this automatically, because it should be already sending the reload signal to BIND after any records are updated.
If you use the DNS Records page in the Virtualmin UI to add a test record to a domain, can it be queried immediately?
Just added new record thru DNS Records page in Virtualmin UI, and you're right it queries immediately.
However, as doing the modify-dns --sync-tlsa
command, resulted in TLSA records continued to remain invalid for some time. Probably the reason the new valid TLSA records started to appear and DANE validated MTA mail connections working is due to the DNS zone default TTL 3600 seconds (one hour) had passed, and after that time, the cached TLSA records expired on internet caching DNS resolvers, so they loaded the new TLSA records from this authoritative server.
Oh yeah, the TTL could cause this. Are the records still out of date if you query the nameserver on your Virtualmin system directly?
The records are up to date when you query the Virtualmin system directly, and when you query some internet resolvers such as CF 1.0.0.1
apparently doesn't cache for the zone TTL
(?), however you get stale records for maybe a minute when you query the popular 8.8.8.8
.
Test it yourself, make a new DNS AAAA
address record, nslookup
it locally, on CF, and on google. Modify the same AAAA
record address, look it up locally, on CF it should be the new one, and google is still the old one, just for a short time.
This shell script might be good to look at, see if there is any unknown gap in the existing methods.
Ok .. that seems like expected behavior then. DNS isn't guaranteed to update immediately across all resolvers and caches.
This DANE TLSA mailing list message has a practical strategy to prevent these cached expired TLSA
records which fail validating connections.
This danectl
library implements this strategy for you:
Subject: A sensible "3 1 1" + "3 1 1" key rotation approach
From: Viktor Dukhovni ietf-dane at dukhovni.org
Date: Fri Feb 16 21:05:20 CET 2018
To avoid (even temporary) mismatches always publish multiple (two
are enough) TLSA records. One for matching the current certificate
chain, and another matching the *future* certificate chain.
You might ask how the *future* certificate chain can be predicted,
but the answer is simple enough. While you may not know all the
certificate details, you can control the public key that goes into
the future certificate. This can be matched with a "3 1 1" record.
Therefore, the recommended key rotation approach is:
1. Whenever you *deploy* a certificate chain with a new key,
at the same time (that way you won't forget later) generate
the next key! And that time update your TLSA records to match
both keys. You only need the next public key for this, the
next private key could be password protected if you like, but
for most sites just rotating the keys often and letting the OS
protect the keys from all but the authorized account is enough.
_25._tcp.smtp.example.com. IN TLSA 3 1 1 <sha256(curr key)>
_25._tcp.smtp.example.com. IN TLSA 3 1 1 <sha256(next key)>
2. When it is time to obtain a new certificate, generate the CSR
from the previously generated *next* key (this may require
decryption of the private if stored encrypted initially) and
request the certificate, *but* do that only if the corresponding
TLSA record is already published! Do the DNS lookup to verify.
3. To deploy the newly obtained certificate go back to step 1.
--
Viktor.
Interesting approach, but would that even be possible when using Let's Encrypt, as the certbot
tool generates the private key at the time the cert is requested? The only solution I can see that might work is to not install the new cert right away, and instead generate TLSA records from it, wait for them to propagate via DNS, then install the cert. But that adds a fair bit of complexity to the whole process ...
Alternately, could this be solved by just setting a really low TTL on TLSA records?
The tradeoff to lower TTL
would be, higher BIND DNS query load on the Virtualmin hosted authoritative DNS zone, which can grow into something resembling a DNS DDoS attack storm, the more popular your hosted DNS sites become, the more that the growing number of validating clients have to lookup those uncached TLSA
records.
First, we should investigate why the default zone TTL
of 3600 seconds was apparently ignored by CF and google public resolvers, at least for AAAA
records. Test if they ignore TTL
also for TLSA
records during their refresh with new records when the Let's Encrypt TLS
certs renew, and take a poll what these libraries do with TLSA
TTL
timeouts on their secured DNSSEC zones.
3600 seconds is pretty long though, even if it's not ignored. That could mean a full hour in which the SSL cert appears to be invalid for some clients, if they are still using the old TLSA records?
Agree, cached stale records out there is no bueno, however, we must reproduce this repeatedly reliably again before making a solution. My recent test was only on AAAA
records which could have a different cache policy than the influential TLSA
records. To force this bug again for accurate observation, is there a standard command line to have Virtualmin force an earlier than scheduled LE cert renewal on a specified domain?
Yeah, you can force a cert renewal by running virtualmin generate-letsencrypt-cert --domain yourdomain.com
I did research and ran tests on TLSA
records expiring, just to see how it broke and would trigger a notification from the DANE
TLSA
survey bot. Now I know why the best and only reasonable working solution they recommend is to do the two TLSA
records per port/protocol/fully qualified hostname combination. Because you need one for Current TLSA
and one for Next TLSA
, so that when the Current expires (forced expiration with early cert renewal, or cert truly ran to or near its expiration date), the switchover to the new TLSA
will be smooth with zero downtime because this new TLSA
is already in the DNS zone, so your users get saved from suffering under with the time to propagate brand new never before seen TLSA
records out from your master zone, to all caching DNS resolvers on the internet. Because some caching DNS resolvers are quick like seconds to begin to return the new TLSA
records, and some take more time than the TTL
to answer with the newly propagated TLSA
records! So the "Reduce TTL
method" is unreliable and should not be used as a method to make the internet's caching DNS resolvers expire those old TLSA
records sooner, in reality that is too sluggish of a method, compared to the Current and Next TLSA
records method.
Thanks for the research! It's a pity the option to add the TLSA records in advance is so complex to implement in practice..
Agreed, it's a pity, however it's because users sometimes need/want to renew their certs with no delay, so there's no time to reduce TTL
on the zone and let the old TTL
pass before the zone has propagated to the world. This Current and Next TLSA
record approach was brought up and agreed by IETF in 2011. Luckily there are two open source libraries which help with these Current and Next TLSA
records.
Do you want these steps to reproduce and verify this invalid expired TLSA
record?
I understand the process, the tricky part would be re-designing how Virtualmin does SSL cert requests to add a delay between when the cert is generated, and when it is installed. Currently all the cert renewal code does both in a single operation, which isn't easy to split up.
Maybe the only practical work-around is to set a lower TTL on the TSLA records by default?
I guess that might be a workaround until the libraries that do the two keys per TLSA
record (current and next) can be added. How long is most website visits... 5min? 10min?
Again is coming same notification email from DANE TLSA survey bot
!
What would explain it is.. Failed/mismatch/expired TLSA
records being served by BIND9 DNS, 1 day after the Lets Encrypt TLS cert renewed.
https://stats.dnssec-tools.org/explore/?virtualserver.tld
Issues found with the
virtualserver.tld
domain:
The DANE TLSA records of these MX hosts fail to validate their certificate chains. Inbound email may be delayed or not delivered.
mail.virtualserver.tld xxx.xxx.xxx.xxx no TLSA record match
mail.virtualserver.tld 2xxx:xxxx:xxxx:xxxx::xxxx no TLSA record match
mail.virtualserver.tld
The TLSA records fail to validate the certificate chain at one or more IP addresses. Inbound email may be delayed or not delivered.
DANE-EE (3) Cert (0) SHA2-256 (1) cf05ae132fc7c21dc9b2675ed093a862ac65e919d0ec0127ba3496439f724bbe
https://dane.sys4.de/smtp/virtualserver.tld
SMTP (fail icon)
The domain lists the following MX entries:
5 mail.virtualserver.tld
IP Addresses
xxx.xxx.xxx.xxx
2xxx:xxxx:xxxx:xxxx:0:0:0:xxxx
Usable TLSA Records
3, 0, 1 cf05ae132fc7c21d[...]ba3496439f724bbe - certificate not trusted: (27) - certificate not trusted: (27)
Here's a simple workaround to get no downtime on the TLSA records with LetsEncrypt certbot
because, as you said @jcameron , you can't really get the "future next" key pair from LetsEncrypt, at least not now.
So the fix is just to use certbot renew --reuse-key
? That seems like something we could do pretty easily ...
Yes, because certbot
doesn't have a way for you to create a "future key"... the way cerbot
currently works, any key you create, is the current key. 😞
Ok, I'll look into this ...
Ok, the next release of Webmin will re-use existing Let's Encrypt keys by default..
@chris001 Did this get fixed?
@shoulders
Close but not 100%.
The domain lists the following MX entries: 5 mail.virtualsubserver.tld DNSSEC ✔️ TLSA ✔️ SMTP ❌ All TLSA RRs failed. (See details.)
IP Addresses aaa.bb.ccc.ddd AAAA:BBB:CCCC:DDD:E:F:G:H
Usable TLSA Records 3, 0, 1 1dc4e0cbfab83ac0[...]cdaf9998b83a9c34 - certificate not trusted: (27) - certificate not trusted: (27)
Looks like this will be fixed by the change suggested here : https://github.com/virtualmin/virtualmin-gpl/issues/803
Note: This bug probably would be fixed, and DANE Survey bot would rate TLSA records correct, if Virtualmin was to configure BIND settings to have BIND do fully automatic DNSSEC record generation, record validation, key management and key rollover with respect to the TTLs of DNSSEC records, as mentioned in comments on #410
Details web page:
The DANE TLSA records of these MX hosts fail to validate their certificate chains. Inbound email may be delayed or not delivered.
The TLSA records fail to validate the certificate chain at one or more IP addresses. Inbound email may be delayed or not delivered.