nigoroll / libvmod-dynamic

The Varnish dns/named director continued
BSD 2-Clause "Simplified" License
95 stars 34 forks source link

Respect TTL in DNS record #21

Closed teohhanhui closed 7 years ago

teohhanhui commented 7 years ago

It'd be great if the DNS resolution can respect the TTL specified in the DNS record (as it should), and also not to apply the TTL if it resolved to no address.

dridi commented 7 years ago

The abstraction currently used (getaddrinfo) doesn't expose the TTL and for now it's too big a change to completely rewrite the resolution logic.

teohhanhui commented 7 years ago

Is it possible to at least not apply the TTL when resolution failed? But I guess it's not going to help much when the TTL from the DNS record is not used.

dridi commented 7 years ago

Please explain what you mean by not applying the TTL in case of a failure. How should the VMOD behave instead?

teohhanhui commented 7 years ago

If getaddrinfo returned an error (for example, when the backend Docker container is not up yet), why should the TTL be applied and cause all other subsequent requests to fail even when the backend is up?

I think it is kind of similar to hit-for-pass vs hit-for-miss in Varnish, and why hit-for-miss is the default now. :smile:

dridi commented 7 years ago

So how frequently should the lookups occur after a failure? Currently if you have a list of backends after a lookup failure, it will be kept around until the next lookup. I see that having a mechanism to force a lookup would be more helpful to mitigate this situation.

teohhanhui commented 7 years ago

Uhh, I've realized that the lookup is active rather than passive. So for example ttl = 5s actually does the lookup every 5 seconds, regardless of whether there are any requests or not. What would be really helpful is a "passive" mode of DNS resolution, which only uses the TTL to determine whether our cached addresses are still fresh. I imagined it to only happen when .backend is called (e.g. in vcl_recv).

teohhanhui commented 7 years ago

Currently if you have a list of backends after a lookup failure, it will be kept around until the next lookup.

That doesn't help when the first lookup failed and it's "cached" for the entire duration of the TTL. Is there a workaround for this case? But also, I think that kind of fallback to stale addresses is potentially unsafe. It compounds the lookup failure by refusing to retry for the entire duration of the TTL.

dridi commented 7 years ago

I imagined it to only happen when .backend is called (e.g. in vcl_recv).

That's how the DNS director worked up to Varnish 3, however it had the side effect of blocking worker threads during the lookup. And in case of a failure you would repeatedly block workers with what you are suggesting. In a setup where one Varnish instance is shared by several virtual hosts, a DNS problem on one of the hosts could dramatically suck resources away from all the hosts.

With this VMOD threads may be blocked only during the first lookup of a domain.

That doesn't help when the first lookup failed and it's "cached" for the entire duration of the TTL. Is there a workaround for this case?

I thought of one, that's why I reopened this ticket. However it has yet to be implemented:

I see that having a mechanism to force a lookup would be more helpful to mitigate this situation.

nigoroll commented 7 years ago

I had also worked on an implementation of caching by DNS TTL and looked at https://www.gnu.org/software/adns/, but did not find the code quality convincing. I did not find any good alternatives either (if you are aware of any, let me know). Developing a resolver from scratch was too big an endeavor within the time my sponsor allowed at the time, so I dropped all plans in this direction for now. But I think there is a viable and working alternative: use nscd. It does exactly what we need - cache dns lookups from our getaddrinfo for the ttl in the dns record. With this setup, the ttl configured in vmod_dynamic becomes a lower bound.

dridi commented 7 years ago

These days I'm told getdns is the new hotness.

https://getdnsapi.net/

nigoroll commented 7 years ago

I've overhauled the documentation and added the recommendation to use nscd. If anyone wants to sponsor more work on this to avoid external caching services, please contact me.

nigoroll commented 5 years ago

this is implemented as of 256b1f01677793f4363318830d7e61ff4de2c0fc when getdns support is comiled in. There now is a ttl_from parameter.