spotify / dns-java

DNS wrapper library that provides SRV lookup functionality
Apache License 2.0
206 stars 47 forks source link

`SERVFAIL` when trying to resolve a service address #44

Open caesar-ralf opened 3 years ago

caesar-ralf commented 3 years ago

What happened?

When upgrading from com.spotify:dns version 3.1.5 to 3.2.2 some of the services started having SERVFAIL even though the service is there.

What was expected?

As there's no breaking change in the perceived API from com.spotify:dns, we expected the changes to not affect functionality.

How to reproduce

We didn't find a good way to reproduce. We didn't manage to pin down what is causing the problem. It seems related to some concurrency, as sometimes the problem doesn't appear. I am more than glad to show the issue happening in a service.

Context

We need to upgrade dnsjava:dnsjava to from version 2.x to 3.x. We checked that com.spotify:dns has done this change in version 3.2.0. We tested in some services and they seem to be working fine, so we decided to roll out the change for all of our users. What happened is that in some of them, from what we can see the ones using gRPC, they started getting SERVFAIL intermittently.

Here is an anonymised stack trace:

Jul 15, 2021 4:29:20 PM io.grpc.internal.ManagedChannelImpl$NameResolverListener handleErrorInSyncContext
WARNING: [Channel<38>: (${PROTOCOL}://${SERVICE})] Failed to resolve name. status=Status{code=UNAVAILABLE, description=null, cause=java.util.concurrent.CompletionException: com.spotify.dns.DnsException: Lookup of '${PREFIX}-${SERVICE}._grpc.services.${DOMAIN_ADDRESS}' failed with code: 2 - SERVFAIL 
    at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
    at java.base/java.util.concurrent.CompletableFuture.uniApplyNow(CompletableFuture.java:683)
    at java.base/java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:658)
    at java.base/java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:2094)
    at com.spotify.grpc.DnsSrvNameResolver.lambda$resolver$4(DnsSrvNameResolver.java:160)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.spotify.dns.DnsException: Lookup of '${PREFIX}-${SERVICE}._grpc.services.${DOMAIN_ADDRESS}' failed with code: 2 - SERVFAIL 
    at com.spotify.dns.XBillDnsSrvResolver.resolve(XBillDnsSrvResolver.java:60)
    at com.spotify.grpc.DnsSrvNameResolver.lambda$resolver$0(DnsSrvNameResolver.java:162)
    at java.base/java.util.concurrent.CompletableFuture.uniApplyNow(CompletableFuture.java:680)
    ... 6 more
}

We tried bumping version of dnsjava:dnsjava from 3.0.2 to 3.4.0 and the problem seemed to go away, but after some minutes (around ~10min) of the service running it started again. I am not sure if this was a local problem.

When we did a dig srv ${PREFIX}-${SERVICE}._grpc.services.${DOMAIN_ADDRESS} some hosts are returned as expected. Changing the version back to com.spotify:dns:3.1.5 and dnsjava:dnsjava:2.x makes the problem go away.

Java version used during the test:

$ java -version
> openjdk version "11.0.10" 2021-01-19 LTS
> OpenJDK Runtime Environment Corretto-11.0.10.9.1 (build 11.0.10+9-LTS)
> OpenJDK 64-Bit Server VM Corretto-11.0.10.9.1 (build 11.0.10+9-LTS, mixed mode)
mattnworb commented 3 years ago

I wonder if there is a bit of confirmation bias going on in noticing of the SERVFAIL responses with the new library - in that there may have been sporadic or intermittent SERVFAIL responses on the old library as well, but no one was paying attention until the library was upgraded (and the owners were told to keep an eye out for any weirdness)