internetstandards / Internet.nl

Internet standards compliance test suite
https://internet.nl
164 stars 36 forks source link

Domain validation issue specific to batch site test form POST validation #1376

Closed mxsasha closed 3 months ago

mxsasha commented 3 months ago

On the web interface of dev-docker.batch, there is some strange behaviour: when trying to start a site test through the field/button on the home page, all domains are considered invalid. There are no issues for mail. This isn't the intended use of the batch site, but it is suspicious whether the root cause has wider effects.

The home page does a form POST to /site or /mail, with a parameter url set to the domain (domain.index view). This returns a redirect to /test-site/?invalid for web if it's invalid, otherwise to /site/<domain> (domain.validate_domain view). Incorrectly, this view always redirects to /test-site/?invalid. The /site/<domain> URL can be called directly, in which case validation does succeed. This is odd, because the /site code will actually call the /site/<domain> code to determine validity. In the context of the POST to /site, it is however invalid. The same relation exists for mail.index and mail.validate_domain but the issue does not occur for mail.

Weirder, calling the same code in the docker container on the exact same view does work:

>>> import internetnl; import django; django.setup()
Batch enabled.
>>> from interface.views import domain
>>> domain.validate_domain(None, "internet.nl")
<SafeHttpResponseRedirect status_code=302, "text/html; charset=utf-8", url="/site/internet.nl/">
>>> domain.validate_domain(None, "internet.nll")
<HttpResponseRedirect status_code=302, "text/html; charset=utf-8", url="/test-site/?invalid">
>>> request = type('obj', (object,), {'POST' : {}})
>>> request.POST["url"] = "internet.nl"
>>> domain.index(request)
<SafeHttpResponseRedirect status_code=302, "text/html; charset=utf-8", url="/site/internet.nl/">

Unbound logs an error when the validation incorrectly fails: [1712058274] libunbound[1:0] error: failed to deserialize newq. This only occurs in the fail scenario, in every other, there are normal unbound logs. Therefore, this is probably the root cause of the incorrect validation result. The unbound code does not reveal much. @gthess perhaps you have more hints?

TLDR: the app fails to run an unbound query due to that unbound error, but only when that specific view is called through domain.index view, not reproducible from REPL, and while succeeding for identical code run with an identical context/settings - as the actual unbound call is a few layers deeper.

Due to the weird set of circumstances, and so far no identified issues in testing docker batch, it is likely that this issue does not affect regular batch measurements. It also has not been seen on single test.

bwbroersma commented 3 months ago

In my view it's not a XHR but a regular html form post:

$ curl 'https://docker.batch.internet.nl/site/' -d 'url=example.nl' -isSfA ''
mxsasha commented 3 months ago

In my view it's not a XHR but a regular html form post:

You're right, edited.

mxsasha commented 3 months ago

I have just now also seen the error: failed to deserialize newq unbound message on a successful mail request, so it may be a red herring.

In my firefox console: does not work:

await fetch("https://dev-docker.batch.internet.nl/site/", {
    "headers": {
        "Content-Type": "application/x-www-form-urlencoded",
        "Authorization": "Basic <removed>",
    },
    "body": "url=ecp.nl",
    "method": "POST",
});

does work:

await fetch("https://dev-docker.batch.internet.nl/mail/", {
    "headers": {
        "Content-Type": "application/x-www-form-urlencoded",
        "Authorization": "Basic <removed>",
    },
    "body": "url=ecp.nl",
    "method": "POST",
});
mxsasha commented 3 months ago

Narrowed down by trying to break it on dev-docker: this is caused by ENABLE_BATCH, specifically because it triggers this import in app setup: https://github.com/internetstandards/Internet.nl/blob/73077d20f7c06058ad1b478be160f7ddad070507/interface/apps.py#L63

Apparently, that has a side effect on... unbound resolving but only through certain paths?

mxsasha commented 3 months ago

From the APIMetadata import the path leads to https://github.com/internetstandards/Internet.nl/blob/73077d20f7c06058ad1b478be160f7ddad070507/interface/batch/util.py#L34 and to https://github.com/internetstandards/Internet.nl/blob/73077d20f7c06058ad1b478be160f7ddad070507/checks/probes.py#L34-L50 Any of the probe imports seem to trigger the issue. IPv6 imports from interface.views.shared again, but not all of them do. They do all use SetupUnboundContext though.

interface.views.shared has a global ub_ctx: https://github.com/internetstandards/Internet.nl/blob/73077d20f7c06058ad1b478be160f7ddad070507/interface/views/shared.py#L31

Conclusions:

Q: why do we have this close integration with libunbound at all? Don't we just need a basic stub resolver to run DNS queries against the permissive or validating resolver? And leave config details to the resolver's own configs? This may have only become a good option with Docker, because we now have well controlled and versioned unbound configs.

bwbroersma commented 3 months ago

👍 Interesting research. About setting the unbound context versus using unboundconf, I had the same question in the TTL issue:

So it's officially a (reproducible) bug now?

gthess commented 3 months ago

Q: why do we have this close integration with libunbound at all? Don't we just need a basic stub resolver to run DNS queries against the permissive or validating resolver? And leave config details to the resolver's own configs? This may have only become a good option with Docker, because we now have well controlled and versioned unbound configs.

Pyunbound was used because it provided a validating resolver baked into internet.nl. Now with the docker setup I believe it can indeed be replaced by a stub library that can evaluate the answers from a trusted source, in this case the locally deployed unbound.

The only thing that pyunbound offers at the moment IIRC is extra information when a DNSSEC algorithm is not supported and it gives a different result to the DNSSEC validity test. But nowadays with Extended DNS Errors (EDE) Unbound will attach EDE 2 for the same case. So Unbound configured to use EDE and a stub that can read EDEs, which are EDNS options, would handle the same case and require one less patch to the currently used pyunbound. I don't have any experience with Python stub resolvers so I can't recommend anything in particular.

Btw, if pyunbound is replaced with a stub, activating the pre-checks ala single test on the UI can be reconsidered. IIRC I shut them down because of performance reasons while resolving all the batch domains before starting the test (this may have had to do something with pyunbound/celery but didn't investigate further). Also batch users are supposed to use a somewhat curated list of domains so turning off the pre-checks could pass as an argument.

mxsasha commented 3 months ago

So it's officially a (reproducible) bug now?

Yes. Though there may be other circumstances involved, that are equal between dev-docker and dev-bocker.batch. I have a fix in mind. In general I really do not like code that has these large side effects on import - importing a Python module should generally not cause DNS queries.

mxsasha commented 3 months ago

Confirmed fixed by ee94b3d on dev-docker.batch 🎉