spring-cloud / spring-cloud-gateway

An API Gateway built on Spring Framework and Spring Boot providing routing and more.
http://cloud.spring.io
Apache License 2.0
4.53k stars 3.32k forks source link

Provide a way to customize via Spring framework a TTL for Netty's DNS cache #3517

Open dimzul opened 2 months ago

dimzul commented 2 months ago

Problem

In k8s environment multiple instances of the same service are hidden by k8s Service name (like, my-test.my-namespace.svc.cluster.local). Same goes with DNS servers in k8s: multiple instances of it are hidden by k8s Service. In a case when one DNS server instance dies and emerges on a new k8s node with another IP address, due to DNS cache in Netty (transitive dependency of project-reactor) via DnsNameResolverBuilder and DefaultAuthoritativeDnsServerCache, IP addresses of DNS servers are cached for Integer.MAX_VALUE seconds by default and old/cached IP address is used for DNS resolution. This results in a request to the IP address with no listening DNS server and causes next error:

500 Server Error for HTTP GET "/my-test"
io.netty.resolver.dns.DnsResolveContext$SearchDomainUnknownHostException: Failed to resolve 'my-test' [A(1)] and search domain query for configured domains failed as well: [production.svc.cluster.local, svc.cluster.local, cluster.local]
    at io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:1151)
    Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException: 
Error has been observed at the following site(s):
    *__checkpoint ⇢ org.springframework.cloud.gateway.filter.WeightCalculatorWebFilter [DefaultWebFilterChain]
    *__checkpoint ⇢ HTTP GET "/my-test" [ExceptionHandlingWebHandler]
Original Stack Trace:
        at io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:1151)
        at io.netty.resolver.dns.DnsResolveContext.tryToFinishResolve(DnsResolveContext.java:1098)
        at io.netty.resolver.dns.DnsResolveContext.query(DnsResolveContext.java:457)
        at io.netty.resolver.dns.DnsResolveContext.access$700(DnsResolveContext.java:69)
        at io.netty.resolver.dns.DnsResolveContext$2.operationComplete(DnsResolveContext.java:526)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583)
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
        at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
        at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)
        at io.netty.resolver.dns.DnsQueryContext.finishFailure(DnsQueryContext.java:380)
        at io.netty.resolver.dns.DnsQueryContext$5.run(DnsQueryContext.java:315)
        at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
        at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
        at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:405)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: io.netty.resolver.dns.DnsNameResolverTimeoutException: [15365: /12.34.5.67:53] DefaultDnsQuestion(my-test.production.svc.cluster.local. IN A) query '15365' via UDP timed out after 5000 milliseconds (no stack trace available)

Steps to reproduce

Following suggestions by @violetagg and @spencergibb on customizing DNS cache TTL and TcpClient in Spring Cloud Gateway, a next configuration was made:

@Component
@Configuration(proxyBeanMethods = false)
public class DnsCacheCustomizer implements HttpClientCustomizer {

    private static final int CACHE_TTL = 5;

    @Bean
    ClientHttpConnector clientHttpConnector(ReactorResourceFactory resourceFactory) {
        TcpClient tcpClient = TcpClient.create(resourceFactory.getConnectionProvider())
                .resolver(nameResolverSpec -> nameResolverSpec.cacheMaxTimeToLive(Duration.ofSeconds(CACHE_TTL)));
        return new ReactorClientHttpConnector(HttpClient.from(tcpClient));
    }

    @Override
    public HttpClient customize(HttpClient httpClient) {
        DnsNameResolverBuilder dnsResolverBuilder = new DnsNameResolverBuilder()
                .channelFactory(EpollDatagramChannel::new)
                .resolveCache(new DefaultDnsCache(0, CACHE_TTL, 0));
        httpClient
                .resolver(nameResolverSpec -> nameResolverSpec.cacheMaxTimeToLive(Duration.ofSeconds(CACHE_TTL)))
                .tcpConfiguration(tcpClient -> tcpClient.resolver(new DnsAddressResolverGroup(dnsResolverBuilder)));
        return httpClient;
    }
}

Having such a configuration, multiple instances of DnsNameResolverBuilder were created: 2 with the configured cache TTL and 2 with the default cache TTL:

cache_as_configured_1 cache_as_configured_2 cache_as_default_1 cache_as_default_2

But when an actual request comes in, the DnsNameResolverBuilder with a default cache TTL configuration is used and DNS cache with default TTL (2147483647 seconds) is applied:

actual request

Expected result

There is a way to configure DNS cache TTL via Spring Framework.

Versions

spring boot/spring-cloud-starter-gateway/spring-boot-starter-webflux: 3.2.8 reactor-netty-http: 1.1.21 netty: 4.1.111.Final

bindupatnaik commented 1 month ago

@spring-cloud-issues any update on this issue? i was also facing same problem mentioned in this issue and was looking for help. I also commented in this open issue https://github.com/spring-cloud/spring-cloud-gateway/issues/561 . pls provide an update when we are getting this issue fixed? I tried all the work arounds mentioned with no luck.

bindupatnaik commented 1 month ago

@dimzul did you find any workarounds for this problem? I am happy to connect with you to discuss further.

dimzul commented 1 month ago

@bindupatnaik, unfortunately, no: all provided solutions don't have any effect on DNS cache TTL in Netty. I've debugged it locally and tested in real cluster and got the same result with default TTL applied. Also no effect with switching to JVM built-in resolver via:

    @Override
    public HttpClient customize(HttpClient httpClient) {
        httpClient
                .resolver(DefaultAddressResolverGroup.INSTANCE)
                .tcpConfiguration(tcpClient -> tcpClient.resolver(DefaultAddressResolverGroup.INSTANCE));
        return httpClient;
    }

If you find a solution, please share it here.

violetagg commented 1 month ago

@dimzul

This configuration is not quite correct. You either use the HttpClient#resolver or HttpClient#tcpConfiguration but never both. I would recommend HttpClient#resolver. HttpClient#tcpConfiguration is deprecated and everything that you can configure there, you can configure with direct invocation of HttpClient.

@Override
    public HttpClient customize(HttpClient httpClient) {
        httpClient
                .resolver(DefaultAddressResolverGroup.INSTANCE)
                .tcpConfiguration(tcpClient -> tcpClient.resolver(DefaultAddressResolverGroup.INSTANCE));
        return httpClient;
    }

DefaultAddressResolverGroup.INSTANCE is the JDK's built-in domain name lookup mechanism so you need to use the JDK configuration for the ttl.

I also do not recommend using HttpClient#from which is also deprecated.

ParkerM commented 1 month ago
    @Override
    public HttpClient customize(HttpClient httpClient) {
        httpClient
                .resolver(DefaultAddressResolverGroup.INSTANCE)
                .tcpConfiguration(tcpClient -> tcpClient.resolver(DefaultAddressResolverGroup.INSTANCE));
        return httpClient;
    }

Note that the fluent config methods in reactor-netty's HttpClient don't modify the instance -- they configure and return a duplicated instance. This has bitten me before, and was ultimately solved by reassigning each call or returning the entire chain. Try this:

    @Override
    public HttpClient customize(HttpClient httpClient) {
        return httpClient
                .resolver(DefaultAddressResolverGroup.INSTANCE)
                .tcpConfiguration(tcpClient -> tcpClient.resolver(DefaultAddressResolverGroup.INSTANCE));
    }