opnsense / core

OPNsense GUI, API and systems backend
https://opnsense.org/
BSD 2-Clause "Simplified" License
3.07k stars 700 forks source link

Unbound: allow discard-timeout to be configured #7493

Open planetf1 opened 4 weeks ago

planetf1 commented 4 weeks ago

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

Is your feature request related to a problem? Please describe.

In Unbound DNS, if RFC 8767 support is configured with defaults (1800ms), replies that arrive >1900ms (discard-timeout) do not update the cache. This severely impacts the usability of RFC 8767 which can be invaluable when dealing with slow resolvers.

Describe the solution you like

I would like to be able to configure 'discard-timeout' on the UI (Advanced settings), probably within 'serve expired settings'

The help should guide the user to make this value higher than the client expired response timeout and warn that otherwise these late responses will not update the cache

Describe alternatives you considered

An alternative would be to allow custom settings, However this was previously removed, and does not help/guide the user who's using serve-expired to get a valid configuration.

I tried manually editing some of the unbound configuration, but these of course get overwritten on service start....

Additional context

See initial discussion at https://forum.opnsense.org/index.php?topic=40738.0

Note the reference to being a bit larger than server-expired-client-timeout in the official docs at https://unbound.docs.nlnetlabs.nl/en/latest/manpages/unbound.conf.html

discard-timeout: The wait time in msec where recursion requests are dropped. This is to stop a large number of replies from accumulating. They receive no reply, the work item continues to recurse. It is nice to be a bit larger than serve-expired-client-timeout if that is enabled. A value of 1900 msec is suggested. The value 0 disables it.

Default: 1900

I am planning to experiment with a value around 3000 (since there is a related timeout for all tcp queries against auth servers that defaults to 3000)

The biggest cause of slow replies in my case was 'far away' servers. I am using quad9 DoT, but some chinese servers in particular can on occasion take 2,3,4 seconds to reply. serve-expired can work well here, but I want to catch the responses that finally come back so that I keep the cache as fresh as possible

Note that this particular code path results in the request queue exceeded counter increasing. A little misleading, as really this is a 'requests dropped' counter. However that's a base unbound issue... This was how I originally noticed the issue, questioning why the counter was increasing when I was nowhere near queue size limits.

Finally if it helps I could consider trying to build a PR. I'm an experienced dev & have worked with open source, but never opnsense.

fichtner commented 3 weeks ago

It should be quite easy to add if you want to take a stab at it follow 387fc592d7cbd68

planetf1 commented 3 weeks ago

It should be quite easy to add if you want to take a stab at it follow 387fc59

I'll give it a go (may be a few days). thx

planetf1 commented 3 weeks ago

Thanks for the tip. I got past some build glitches (git) and I think the PR is appropriate.

fichtner commented 2 weeks ago

Left small review notes, but I'm confident this will land soon :)