Open caesar-ralf opened 3 years ago
I wonder if there is a bit of confirmation bias going on in noticing of the SERVFAIL
responses with the new library - in that there may have been sporadic or intermittent SERVFAIL
responses on the old library as well, but no one was paying attention until the library was upgraded (and the owners were told to keep an eye out for any weirdness)
What happened?
When upgrading from
com.spotify:dns
version3.1.5
to3.2.2
some of the services started havingSERVFAIL
even though the service is there.What was expected?
As there's no breaking change in the perceived API from
com.spotify:dns
, we expected the changes to not affect functionality.How to reproduce
We didn't find a good way to reproduce. We didn't manage to pin down what is causing the problem. It seems related to some concurrency, as sometimes the problem doesn't appear. I am more than glad to show the issue happening in a service.
Context
We need to upgrade
dnsjava:dnsjava
to from version2.x
to3.x
. We checked thatcom.spotify:dns
has done this change in version3.2.0
. We tested in some services and they seem to be working fine, so we decided to roll out the change for all of our users. What happened is that in some of them, from what we can see the ones using gRPC, they started gettingSERVFAIL
intermittently.Here is an anonymised stack trace:
We tried bumping version of
dnsjava:dnsjava
from3.0.2
to3.4.0
and the problem seemed to go away, but after some minutes (around ~10min) of the service running it started again. I am not sure if this was a local problem.When we did a
dig srv ${PREFIX}-${SERVICE}._grpc.services.${DOMAIN_ADDRESS}
some hosts are returned as expected. Changing the version back tocom.spotify:dns:3.1.5
anddnsjava:dnsjava:2.x
makes the problem go away.Java version used during the test: