quarkusio / quarkus

Quarkus: Supersonic Subatomic Java.
https://quarkus.io
Apache License 2.0
13.84k stars 2.7k forks source link

No server chosen by com.mongodb.reactivestreams: Failed looking up SRV record #37618

Closed charalampos-micholias closed 2 months ago

charalampos-micholias commented 11 months ago

Describe the bug

Issue using Quarkus 2.15.0 version

When trying a connection attempt in a mongo shard through a Quarkus application version >= 2.15.0 below error is propagated:

_2023-11-24 10:07:53,815 INFO  [com.vod.dxl.gr.sys.uti.SRVQuery] (vert.x-eventloop-thread-0) 0 0 27017 dxl-gr-preprod-shard-00-02-nuler.mongodb.net.
2023-11-24 10:07:53,907 INFO  [com.vod.dxl.gr.sys.uti.SRVQuery] (vert.x-eventloop-thread-0) 0 0 27017 dxl-gr-preprod-shard-00-00-nuler.mongodb.net.
2023-11-24 10:07:53,908 INFO  [com.vod.dxl.gr.sys.uti.SRVQuery] (vert.x-eventloop-thread-0) 0 0 27017 dxl-gr-preprod-shard-00-01-nuler.mongodb.net.
2023-11-24 10:07:56,108 INFO  [org.mon.dri.cluster] (executor-thread-1) No server chosen by com.mongodb.reactivestreams.client.internal.ClientSessionHelper$$Lambda$441/0x0000000050de1ef0@11222f0c from cluster description ClusterDescription{type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.dxl-gr-preprod-nuler.mongodb.net'., connectionMode=MULTIPLE, serverDescriptions=[]}. Waiting for 30000 ms before timing out
2023-11-24 10:07:59,508 INFO  [org.mon.dri.cluster] (cluster-656075d114c7f278b4f504d7) No server chosen by WritableServerSelector from cluster description ClusterDescription{type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.dxl-gr-preprod-nuler.mongodb.net'., connectionMode=MULTIPLE, serverDescriptions=[]}. Waiting for 30000 ms before timing out_

Issue using Quarkus 3.1.0 version

As described on #32991 this issue is resolved in "Quarkus 3.0.2.Final" version

Based on the aforementioned proposal, upon upgrading on Quarkus "3.1.0.Final" a new error (resulting to same mongo connection inability) takes place.

Below you may find exact logs:

_07:24:02 INFO  traceId=afd520fa41b8e64ab71320ef860625cb, parentId=, spanId=914390a5279baf32, sampled=true [or.mo.dr.cluster] (vert.x-eventloop-thread-0) No server chosen by com.mongodb.reactivestreams.client.internal.ClientSessionHelper$$Lambda$771/0x0000000064382208@64e7bf16 from cluster description ClusterDescription{type=UNKNOWN, connectionMode=SINGLE, serverDescriptions=[ServerDescription{address=localhost:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.ConnectException: Connection refused}}]}. Waiting for 30000 ms before timing out

Dedicated Script for further logs: In attempt to have further view on Mongo SRV attributes we have created a dedicated script - triggered each time mongo data retrieval- is attempted that provides below logs:

_07:03:36 INFO  traceId=3cd5ded71f5024a926d7b084400c8967, parentId=, spanId=616220dab51c5f28, sampled=false [gr.vo.dx.su.se.SRVQuery] (executor-thread-1) Providing SRV attributes logs....
07:03:36 INFO  traceId=3cd5ded71f5024a926d7b084400c8967, parentId=, spanId=616220dab51c5f28, sampled=false [gr.vo.dx.su.se.SRVQuery] (executor-thread-1) 0 0 27017 dxl-gr-preprod-shard-00-01-nuler.mongodb.net.
07:03:36 INFO  traceId=3cd5ded71f5024a926d7b084400c8967, parentId=, spanId=616220dab51c5f28, sampled=false [gr.vo.dx.su.se.SRVQuery] (executor-thread-1) 0 0 27017 dxl-gr-preprod-shard-00-02-nuler.mongodb.net.
07:03:36 INFO  traceId=3cd5ded71f5024a926d7b084400c8967, parentId=, spanId=616220dab51c5f28, sampled=false [gr.vo.dx.su.se.SRVQuery] (executor-thread-1) 0 0 27017 dxl-gr-preprod-shard-00-00-nuler.mongodb.net.

07:24:08 INFO  traceId=3cd5ded71f5024a926d7b084400c8967, parentId=, spanId=616220dab51c5f28, sampled=false [or.mo.dr.cluster] (executor-thread-3) No server chosen by com.mongodb.reactivestreams.client.internal.ClientSessionHelper$$Lambda$771/0x0000000064382208@53b297d6 from cluster description ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, serverDescriptions=[ServerDescription{address=dxl-gr-preprod-shard-00-02-nuler.mongodb.net:27017, type=UNKNOWN, state=CONNECTING}, ServerDescription{address=dxl-gr-preprod-shard-00-01-nuler.mongodb.net:27017, type=UNKNOWN, state=CONNECTING}, ServerDescription{address=dxl-gr-preprod-shard-00-00.nuler.mongodb.net:27017, type=UNKNOWN, state=CONNECTING}]}. Waiting for 30000 ms before timing out_

Expected behavior

Quarkus should connect to Mongo Atlas without this SRV lookup or MongoSocketOpenException error

Actual behavior

Connection failure takes place when reaching the established time thresshold "Exception opening socket}, caused by {java.net.ConnectException: Connection refused}}]}. Waiting for 30000 ms before timing out"

How to Reproduce?

Using the free tier mongo atlas account,

Output of uname -a or ver

MINGW64_NT-10.0-22621 E0624331 3.3.3-341.x86_64 2022-01-17 11:45 UTC x86_64 Msys

Output of java -version

java version "17.0.7" 2023-04-18 LTS

Quarkus version or git rev

2.15.0 and 3.1.0

Build tool (ie. output of mvnw --version or gradlew --version)

No response

Additional information

Dedicated script getMongoDebugLogs() :

public class SRVQuery {

    private static final Logger log = LoggerFactory.getLogger(SRVQuery.class);
    public static void getMongoDebugLogs() {
        try {
            // Set up the environment for creating the initial context
            Hashtable<String, String> env = new Hashtable<>();
            env.put(Context.INITIAL_CONTEXT_FACTORY, "com.sun.jndi.dns.DnsContextFactory");
            env.put(Context.PROVIDER_URL, "dns:");
            DirContext ctx = new InitialDirContext(env);
            // Specify the name of the domain to query
            String domainName = "_mongodb._tcp.dxl-gr-preprod-nuler.mongodb.net";
            // Perform the SRV lookup
            Attributes attrs = ctx.getAttributes(domainName, new String[]{"SRV"});
            Attribute attr = attrs.get("SRV");
            log.info("Providing SRV attributes logs....");
            if (attr != null) {
                for (int i = 0; i < attr.size(); i++) {
                    log.info((String) attr.get(i));
                }
            } else {
                log.info("No SRV records found for " + domainName);
            }
            ctx.close();
        } catch (NamingException e) {
            log.info("Handing mongo debugging exception..");
            e.printStackTrace();
            log.info(e.getMessage());
        }
    }

}
quarkus-bot[bot] commented 11 months ago

/cc @evanchooly (mongodb), @loicmathieu (mongodb)

loicmathieu commented 11 months ago

Hi, Can you try with the latest Quarkus version, it should be 3.6.1. If it still fails, can you provide a reproducer project?

charalampos-micholias commented 11 months ago

Hello, good afternoon, After several further tests we have identified the exact pattern that this issue occurs. More specific issue occurs on the below combination:

keep as important note -as a tested workaround- that issue does not appear upon combining Quarkus 3.1.0 and connection string containing exact mongo shards formatted as: mongodb://USERNAME:PASSWORD@XXXXX-shard-00-00.nuler.mongodb.net:27017,XXXXX-shard-00-01-nuler.mongodb.net:27017/

So after many thorough test seems that even Quarkus 3.1.0 version faces inability to establish connection when a SRV mongo connection string is provided.

Exact logs for aforementioned combination:

15:09:53 INFO  traceId=, parentId=, spanId=, sampled= [or.mo.dr.cluster] (cluster-6578779b5cf3934a85277080) No server chosen by WritableServerSelector from cluster description ClusterDescription{type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.dxl-gr-preprod-nuler.mongodb.net'., connectionMode=MULTIPLE, serverDescriptions=[]}. Waiting for 30000 ms before timing out

15:09:56 INFO  traceId=0caf882656bffbdbd9cae0ca6fbfeb3f, parentId=, spanId=c9077c52e865326b, sampled=false [or.mo.dr.cluster] (executor-thread-2) No server chosen by com.mongodb.reactivestreams.client.internal.ClientSessionHelper$$Lambda$790/0x00000000903f70f8@4180c2b from cluster description ClusterDescription{type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.dxl-gr-preprod-nuler.mongodb.net'., connectionMode=MULTIPLE, serverDescriptions=[]}. Waiting for 30000 ms before timing out

15:09:57 INFO  traceId=, parentId=, spanId=, sampled= [or.mo.dr.cluster] (cluster-6578779b5cf3934a85277080) No server chosen by ReadPreferenceServerSelector{readPreference=primary} from cluster description ClusterDescription{type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.dxl-gr-preprod-nuler.mongodb.net'., connectionMode=MULTIPLE, serverDescriptions=[]}. Waiting for 30000 ms before timing out_

_15:10:23 ERROR traceId=ad5511d672b3e0370c637d1fb918f448, parentId=, spanId=12c2d61cc5444467, sampled=true [io.qu.mu.ru.MutinyInfrastructure] (vert.x-eventloop-thread-0) Mutiny had to drop the following exception: com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.dxl-gr-preprod-nuler.mongodb.net'., servers=[]
    at com.mongodb.internal.connection.BaseCluster.createTimeoutException(BaseCluster.java:428)
    at com.mongodb.internal.connection.BaseCluster.handleServerSelectionRequest(BaseCluster.java:324)
    at com.mongodb.internal.connection.BaseCluster.access$700(BaseCluster.java:69)
    at com.mongodb.internal.connection.BaseCluster$WaitQueueHandler.run(BaseCluster.java:503)
    at java.base/java.lang.Thread.run(Unknown Source)

As upgrading further in 3.6.1 will require extended effort in a series of services, would you please update us if there are implemented enhancements in mongo connection establishment area in Quarkus 3.6.1 that could resolve the above issue and which are they in high level?

Thank you all for your assistance

va1erian commented 11 months ago

Hello, we are experiencing similar issues with Quarkus 3.5.3, we have these stacktraces showing every once in a while when connected to a MongoDB Atlas cluster: Unable to look up SRV record for host _mongodb._tcp.option-pl-0.91p5n.mongodb.net we are using the reactive client. Sorry I don't have th full stacktrace, can't find a way to export it cleanly for datadog.

roiar commented 10 months ago

hello all! @loicmathieu what kind of corrections are available on 3.6.1?

charalampos-micholias commented 10 months ago

Hello @loicmathieu, we would like to ask if there is an ongoing investigation on the above issue. Thank you,

danielspeixoto commented 5 months ago

Facing the same problem here

loicmathieu commented 5 months ago

@va1erian @roiar @charalampos-micholias in JVM mode we use the default MongoDB DNS resolver, nothing specific to Quarkus. However, in native mode there is an experimental DNS resolver as the MongoDB DNS resolver uses JNI which is not supported in native, this is explained here: https://quarkus.io/guides/mongodb#using-mongosrv-urls

Did you deploy to native and use the experimental DNS resolver?

va1erian commented 5 months ago

@loicmathieu we deploy a regular JVM without touching the dns resolver configuration. However note that this issue disappeared at one point when we upgraded Quarkus, I can't recall the exact version but it has been a couple of month already. We are currently running Quarkus 3.10

Tomatenjoghurt commented 5 months ago

We're facing exactly the same issue running Quarkus 3.10. Don't know if this is relevant, but we're using JVM Images for Java 17, so, if I read the documentation correctly, it should be this base image: registry.access.redhat.com/ubi8/openjdk-17-runtime:1.19

Relevant Log Messages:

Exception while resolving SRV records
No server chosen by ReadPreferenceServerSelector{readPreference=primary} from cluster description ClusterDescription{type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.some.name.mongodb.net'., connectionMode=MULTIPLE, serverDescriptions=[]}. Waiting for 30000 ms before timing out

If we remove the srv Record from the config and add all the single database entries, it is working.

geoand commented 2 months ago

See https://github.com/quarkusio/quarkus/discussions/31971#discussioncomment-10613900 for a possible workaround

geoand commented 2 months ago

We believe that #43222 fixes this, so I will close it for now, but if the issue persists with 3.15 (which will contain the fix), we can reopen

Gabriel-Longo-at-DB1 commented 2 months ago

Hello @geoand , we are still facing the issue with Quarkus 3.14.4 (native), the only workaround we were able to apply with major effects, was ditching mongo+srv:// in favor of simple mongo:// connection string. (ref. we used: https://stackoverflow.com/questions/78713411/com-mongodb-mongoconfigurationexception-failed-looking-up-txt-record-for-host)

We inspected the solution in #43222, and concluded that to our case the retry is not effective, since that once the error is emitted by the Mongo connection attempt, nothing will work until the pod / application is restarted and some period (1 - 10min) passes.

We contacted Mongo Atlas support to little help, except an upsell that we bought, and the problem persisted.

Additional info: We have at least 4 Quarkus microservices running 3.8.6 (LTS) (Native) and 3.14 that are not affected by the issue.

geoand commented 2 months ago

We inspected the solution in https://github.com/quarkusio/quarkus/pull/43222, and concluded that to our case the retry is not effective, since that once the error is emitted by the Mongo connection attempt, nothing will work until the pod / application is restarted and some period (1 - 10min) passes.

Can you elaborate on this a little more please?

Gabriel-Longo-at-DB1 commented 2 months ago

We inspected the solution in #43222, and concluded that to our case the retry is not effective, since that once the error is emitted by the Mongo connection attempt, nothing will work until the pod / application is restarted and some period (1 - 10min) passes.

Can you elaborate on this a little more please?

Thanks for the reply, I'll try to elaborate a bit more:

Application will emit the following exception for a couple o seconds, then health-check will fail and app restarted, then we will randomly face the same problem for a few minutes, until connection is successful, then the cycle repeats, so in this case it seems the retry (part of the solution of #43222 ) doesn't work because the srv resolution will persistently fail for minutes at times.

Mutiny had to drop the following exception [Error Occurred After Shutdown]: com.mongodb.MongoTimeoutException: Timed out while waiting for a server that matches ReadPreferenceServerSelector{readPreference=primary}. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.<our-mongo-sub-domain>.mongodb.net'., servers=[]
    at com.mongodb.internal.connection.BaseCluster.createAndLogTimeoutException(BaseCluster.java:392)
    at com.mongodb.internal.connection.BaseCluster.selectServer(BaseCluster.java:148)
    at com.mongodb.internal.connection.AbstractMultiServerCluster.selectServer(AbstractMultiServerCluster.java:54)
    at com.mongodb.internal.binding.ClusterBinding.getReadConnectionSource(ClusterBinding.java:108)
    at com.mongodb.client.internal.ClientSessionBinding.getConnectionSource(ClientSessionBinding.java:128)
    at com.mongodb.client.internal.ClientSessionBinding.getReadConnectionSource(ClientSessionBinding.java:92)
    at com.mongodb.internal.operation.SyncOperationHelper.withSuppliedResource(SyncOperationHelper.java:141)
    at com.mongodb.internal.operation.SyncOperationHelper.withSourceAndConnection(SyncOperationHelper.java:122)
    at com.mongodb.internal.operation.SyncOperationHelper.lambda$executeRetryableRead$4(SyncOperationHelper.java:186)
    at com.mongodb.internal.operation.SyncOperationHelper.lambda$decorateReadWithRetries$12(SyncOperationHelper.java:289)
    at com.mongodb.internal.async.function.RetryingSyncSupplier.get(RetryingSyncSupplier.java:67)
    at com.mongodb.internal.operation.SyncOperationHelper.executeRetryableRead(SyncOperationHelper.java:191)
    at com.mongodb.internal.operation.SyncOperationHelper.executeRetryableRead(SyncOperationHelper.java:173)
    at com.mongodb.internal.operation.CommandReadOperation.execute(CommandReadOperation.java:48)
    at com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:153)
    at com.mongodb.client.internal.MongoDatabaseImpl.executeCommand(MongoDatabaseImpl.java:196)
    at com.mongodb.client.internal.MongoDatabaseImpl.runCommand(MongoDatabaseImpl.java:165)
    at com.mongodb.client.internal.MongoDatabaseImpl.runCommand(MongoDatabaseImpl.java:160)
    at com.mongodb.client.internal.MongoDatabaseImpl.runCommand(MongoDatabaseImpl.java:150)
    at io.quarkus.mongodb.health.MongoHealthCheck$MongoClientCheck$1.get(MongoHealthCheck.java:185)
    at io.quarkus.mongodb.health.MongoHealthCheck$MongoClientCheck$1.get(MongoHealthCheck.java:182)
    at io.smallrye.context.impl.wrappers.SlowContextualSupplier.get(SlowContextualSupplier.java:21)
    at io.smallrye.mutiny.operators.uni.builders.UniCreateFromItemSupplier.subscribe(UniCreateFromItemSupplier.java:28)
    at io.smallrye.mutiny.operators.AbstractUni.subscribe(AbstractUni.java:36)
    at io.smallrye.mutiny.operators.uni.UniRunSubscribeOn.lambda$subscribe$0(UniRunSubscribeOn.java:27)
    at io.quarkus.vertx.core.runtime.VertxCoreRecorder$14.runWith(VertxCoreRecorder.java:635)
    at org.jboss.threads.EnhancedQueueExecutor$Task.doRunWith(EnhancedQueueExecutor.java:2516)
    at org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2495)
    at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1495)
    at org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:11)
    at org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:11)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base@17.0.12/java.lang.Thread.run(Thread.java:840)
    at org.graalvm.nativeimage.builder/com.oracle.svm.core.thread.PlatformThreads.threadStartRoutine(PlatformThreads.java:885)
    at org.graalvm.nativeimage.builder/com.oracle.svm.core.thread.PlatformThreads.threadStartRoutine(PlatformThreads.java:860)

this shorter stack trace is also logged:

Mutiny had to drop the following exception: com.mongodb.MongoTimeoutException: Timed out while waiting for a server that matches ReadPreferenceServerSelector{readPreference=primary}. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.<our-mongo-sub-domain>.mongodb.net'., servers=[]
    at com.mongodb.internal.connection.BaseCluster.createAndLogTimeoutException(BaseCluster.java:392)
    at com.mongodb.internal.connection.BaseCluster.handleServerSelectionRequest(BaseCluster.java:293)
    at com.mongodb.internal.connection.BaseCluster.access$800(BaseCluster.java:85)
    at com.mongodb.internal.connection.BaseCluster$WaitQueueHandler.run(BaseCluster.java:475)
    at java.base@17.0.12/java.lang.Thread.run(Thread.java:840)
    at org.graalvm.nativeimage.builder/com.oracle.svm.core.thread.PlatformThreads.threadStartRoutine(PlatformThreads.java:885)
    at org.graalvm.nativeimage.builder/com.oracle.svm.core.thread.PlatformThreads.threadStartRoutine(PlatformThreads.java:860)

We are also trying find if there are any problems with DNS resolution in our infrastructure, but so far, nothing.

Edit: seems to be related to https://github.com/quarkusio/quarkus/issues/32313

geoand commented 2 months ago

I wonder if quarkus.mongodb.dns.lookup-timeout will help in your case.

Gabriel-Longo-at-DB1 commented 2 months ago

I wonder if quarkus.mongodb.dns.lookup-timeout will help in your case.

Great, we will try-it out and report the results, thank you very much!

geoand commented 2 months ago

🙏🏽