internetstandards / Internet.nl

Internet standards compliance test suite
https://internet.nl
173 stars 37 forks source link

Dane validity failed - use validating resolver for DANE existence subtest #1358

Open je3f0o opened 6 months ago

je3f0o commented 6 months ago

Hello,

My domain's DANE validity has failed for both the website and email servers. In previous years, my domain consistently scored 100 and was even featured on the Hall of Fame page. However, this year, something may have changed, and I may have overlooked it or I forget something I changed. Unfortunately, my DANE validation continues to fail, and I'm unsure of the issue. While other tools shows me positive results, I'm perplexed by what's wrong with my domain. Below is the test result: https://internet.nl/mail/teachmecoding.club/1194082/

Can you help point me in the right direction? I'm using Postfix BTW...

$ postconf mail_version
mail_version = 3.3.0
bwbroersma commented 6 months ago

Interesting, I also checked with https://www.huque.com/bin/danecheck-smtp and it fails on the TLSA DNS record:

GetTLSA: bad response code to TLSA query _25._tcp.teachmecoding.club.: SERVFAIL

I also had some issues (no response on $ dig +short _25._tcp.teachmecoding.club TLSA), and I see you already are on it, https://dnsviz.net/d/_25._tcp.teachmecoding.club/dnssec/: _tcp.teachmecoding.club is a NXDOMAIN, while _25._tcp.teachmecoding.club exists, that is :x: in terms of DNS.

The DANE existence is done with a direct query, while the DANE validity check is done with ldns, probably this one performs a different DNS check, resulting in this correct output, but indeed tricky to understand correctly. https://github.com/internetstandards/Internet.nl/blob/a469e4c151c1740d3f69e36235bc854b0099004a/checks/tasks/tls.py#L1196-L1210

So this is the test that done:

$ openssl s_client -showcerts -crlf -starttls smtp -connect teachmecoding.club:25 -verify_quiet </dev/null 2>/dev/null \
| ldns-dane -c /dev/stdin -n -T -r 9.9.9.9 verify teachmecoding.club 25
dane_query: The queried resource records were bogus

I'm not sure why, but I do sometimes see it succeeds (also on https://www.huque.com/bin/danecheck-smtp):

$ openssl s_client -showcerts -crlf -starttls smtp -connect teachmecoding.club:25 -verify_quiet </dev/null 2>/dev/null \
| ldns-dane -c /dev/stdin -n -T -r 9.9.9.9 verify teachmecoding.club 25
CN=teachmecoding.club dane-validated successfully

So it would help a lot if the output of ldns is included in the failure of DANE here. Note to first solve the DNS issue, and then maybe wait 300 seconds (this is your negative response caching TTL) because #1340 is not yet deployed.

je3f0o commented 6 months ago

Ok, thank you.

I will try to fix DNS issue by contacting to NS authorities. By the way I don't trust the tool https://www.huque.com/bin/danecheck-smtp. Because this site gives me random results. First time it always fail, then i try again DANE Authentication Successful. Basically 50/50 results.

bwbroersma commented 6 months ago

I'm seeing the same mixed results on my command line when running ldns-dane with 9.9.9.9, so it might be due to the resolver (and e.g. some caching).

baknu commented 6 months ago

The failure probably has to do with "qname minimization"(RFC 7816) that the Internet.nl-resolver for the "DANE validity" subtest does. In addition see RFC 8020 which states:

This document states clearly that when a DNS resolver receives a response with a response code of NXDOMAIN, it means that the domain name which is thus denied AND ALL THE NAMES UNDER IT do not exist.

The nameservers of teachmecoding.club do not handle "qname minimization" correctly. See DNSviz and the relevant below test result quote. This seems to be the reason why the subtest fails.

A query for _25._tcp.teachmecoding.club results in a NOERROR response, while a query for its ancestor, _tcp.teachmecoding.club, returns a name error (NXDOMAIN), which indicates that subdomains of _tcp.teachmecoding.club, including _25._tcp.teachmecoding.club, don't exist.

Besides, this case also makes clear that something seems to have changed (unintentionally) in the Internet.nl code base. Previously, such a domain namely would already fail on the "DANE existence" subtest. So, apparently the Internet.nl Unbound (that is used for the "DANE existence" subtest) does not do "qnmame minimization" anymore, while ldns (that is used for the "DANE validity" subtest) still does. For comparison:

I suggest to:

  1. Enable qname mimimization again on Unbound
  2. Make this kind of failure more transpartant/explicit for users which was already proposed in https://github.com/internetstandards/Internet.nl/issues/534

Btw not sure why Internet.nl uses ldns. Alternatively, Internet.nl could probably also query DANE TLSA records with Unbound and validate these records with OpenSSL. See: https://www.openssl.org/docs/manmaster/man3/SSL_CTX_dane_enable.html

bwbroersma commented 6 months ago

I agree to fix #534. The current unbound configuration for the 2 resolvers (resolver-permissive.conf.template and resolver-validating.conf.template) don't explicitly set qname-minimisation: yes, but it should default to yes, also the ldns uses the same (validating) resolver.

baknu commented 6 months ago

@gthess Can you please check what is happening here? Thanks!

gthess commented 6 months ago

I believe this is because of the following change: https://github.com/internetstandards/Internet.nl/blob/a35628fa058c8fbb9c0a267f6eb07bce7dabf60b/checks/tasks/__init__.py#L30 The Unbound context here used to be a validating resolver. When in validation mode, Unbound would end qname-minimization when a signed NXDOMAIN is encountered during resolution. This is the case with this domain where the end result is BOGUS and previous versions error out on DANE existence.

After the above change, queries are forwarded to a non-validating Unbound which upon encountering the NXDOMAIN, skips qname-minimization as best effort in order to try and resolve the query; which it does.

ldns-dane then is configured to use the validating resolver which provides the BOGUS answer.

baknu commented 6 months ago

Thanks @gthess

Interesting that batch is showing the same behaviour as single test. See: https://batch.internet.nl/mail/teachmecoding.club/7964015/#control-panel-28

I expected this change to be part of the Docker release, and thus batch to not show this behaviour (as it currently does not run on Docker). But maybe this change was already present in or backported/copied to the non-Docker. @mxsasha Could you check?

bwbroersma commented 6 months ago

Pre-docker there was: https://github.com/internetstandards/Internet.nl/blob/c5ba541ac143422a2a2c7f7623cd88c6370f8e63/checks/tasks/__init__.py#L42-L43

Defined by: https://github.com/internetstandards/Internet.nl/blob/2d03c943c22a482e410d2d1eb92c57e3ab784daa/internetnl/settings-dist.py#L114 https://github.com/internetstandards/Internet.nl/blob/2d03c943c22a482e410d2d1eb92c57e3ab784daa/internetnl/settings-dist.py#L389-L394

The CENTRAL_UNBOUND was dropped / simplified to IPV4_IP_RESOLVER_INTERNAL_PERMISSIVE:

Since batch currently has the same behavior as docker, probably the unbound conf was changed prior to the dockerization, and the dockerization (this change) itself is correct.

The thing to check is the (central) unbound config on batch, and when it was last changed.

From the documentation it reads the new IPV4_IP_RESOLVER_INTERNAL_VALIDATING it only used for ldns-dane (see code in comment above), and that seems to be the case indeed: https://github.com/internetstandards/Internet.nl/blob/c025ade5aaad177322690631e5e1bfa965daff2f/documentation/Docker-DNS.md#L5-L6

mxsasha commented 6 months ago

Following from discussion with @bwbroersma: the DANE checks should use the validating resolver, and not report bogus records, as they do not "count" for DANE. More optimal: report any bogus response from any check in the DNSSEC result, but this is a big overhaul.

bwbroersma commented 6 months ago

Part of the larger overhaul is in:

Agreed to only fix the validating resolver for the DANE existence subtest for now. Only question is, can the resolver change be done as backport (for v1.8.6) or should it be v1.9?

mxsasha commented 6 months ago

I have looked into the code, and all the resolving is shared, and the DANE lookup may be performed both as part of do_mail_get_servers() or by the dane check itself. I'm wondering if we should not use the non-validating resolver anyways. I know this was discussed, but I'm not sure about the reasoning to set it up this way. However, I would prefer to pick this up properly as part of #1378, so that we can have one clear way to interact with DNS.

baknu commented 6 months ago

I'm wondering if we should not use the non-validating resolver anyways. I know this was discussed, but I'm not sure about the reasoning to set it up this way.

I believe the reasoning was that even if a domain is DNSSEC bogus, Internet.nl can still run all the other subtests that need DNS lookups. See for example: https://internet.nl/site/servfail.nl/2719028/ @gthess: Do you remember more about the reasoning?

Maybe a bit comparable: note that with RPKI invalids Internet.nl is not able to run other subtests (that require connectivity) because our hosting provider is filtering. See for example: https://internet.nl/site/invalid.rpki.isbgpsafeyet.com/2719039/.

Since DNSSEC bogus is an exception that should be resolved first before checking other test results, it seems indeed a good idea to simplify our setup and use only a validating resolver,

gthess commented 6 months ago

@gthess: Do you remember more about the reasoning?

The non-validating Unbound context was there to give answers skipping DNSSEC validation altogether. This was only for the prechecks IIRC. A validating Unbound context does validation and returns both the data and the secure status, which in some tests like the mail auth test the secure bit is not considered just the data IIRC.

If all the contexts will be replaced by a stub, the stub needs to make sure to activate the CD flag (Checking Disabled) on its queries so that it gets also the bogus data back for other tests to work (like the prechecks and the mail auth test). The AD flag on the response would indicate if DNSSEC validation succeeded or failed. (I'll add this last comment also to the stub replacing issue.)