microsoft / spring-data-cosmosdb

Access data with Azure Cosmos DB
MIT License
94 stars 64 forks source link

cosmosdb threads is not able to kill #552

Closed loudywen closed 4 years ago

loudywen commented 4 years ago

Hi, I would like to know is there any ways I can kill or stop the cosmos DB programmatically? I am trying the use @RefreshScope to refresh the cosmos DB connection. But what I am seeing is after the refreshScope triggered. Spring is able to refresh the cosmos DB connection (new parallel-* threads are created) and inject it into other components. However, the old cosmosDB threads still running in the background and it will keep trying to connect to the old URL.

So what I am trying to do is have a listener to listen to the refresh event, once the method is called, I want to kill all the cosmos DB connection. But I have no idea what I should use to stop the cosmos DB instance before it reinitial by Spring

loudywen commented 4 years ago

Scenario (has all CCM setup, has this sample for testing https://github.com/sophiaso/spring-cosmosdb-refresh/blob/master/src/main/java/com/example/cosmosdbrefresh/CosmosConfiguration.java ) steps:

  1. update some wrong value for the URL in ccm (threads before refresh) before

  2. hit the application rest endpoint to trigger refresh

  3. application will pick the new URL, refresh the connection and raise exception after some seconds

    Caused by: reactor.core.Exceptions$ReactiveException: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: no further information: sql-dev-azure-cosmosdb.documents.azure.com/104.215.113.77:4435
    at reactor.core.Exceptions.propagate(Exceptions.java:336)
    at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:91)
    at reactor.core.publisher.Mono.block(Mono.java:1663)
    at com.azure.data.cosmos.internal.RxDocumentClientImpl.initializeGatewayConfigurationReader(RxDocumentClientImpl.java:246)
    at com.azure.data.cosmos.internal.RxDocumentClientImpl.init(RxDocumentClientImpl.java:265)
    at com.azure.data.cosmos.internal.AsyncDocumentClient$Builder.build(AsyncDocumentClient.java:185)
    at com.azure.data.cosmos.CosmosClient.<init>(CosmosClient.java:54)
    at com.azure.data.cosmos.CosmosClientBuilder.build(CosmosClientBuilder.java:203)
    at com.microsoft.azure.spring.data.cosmosdb.CosmosDbFactory.getCosmosClient(CosmosDbFactory.java:59)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:282)
    at org.springframework.cloud.context.scope.GenericScope$LockedScopedProxyFactoryBean.invoke(GenericScope.java:494)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
    at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:688)
    at com.microsoft.azure.spring.data.cosmosdb.CosmosDbFactory$$EnhancerBySpringCGLIB$$f659ab01.getCosmosClient(<generated>)
    at com.microsoft.azure.spring.data.cosmosdb.core.CosmosTemplate.<init>(CosmosTemplate.java:78)
    at com.walmart.intech.armada.api.DatabaseConfiguration.documentDbTemplate(DatabaseConfiguration.java:52)
    at com.walmart.intech.armada.api.DatabaseConfiguration$$EnhancerBySpringCGLIB$$cb8ee2a6.CGLIB$documentDbTemplate$1(<generated>)
    at com.walmart.intech.armada.api.DatabaseConfiguration$$EnhancerBySpringCGLIB$$cb8ee2a6$$FastClassBySpringCGLIB$$e37f77bc.invoke(<generated>)
    at org.springframework.cglib.proxy.MethodProxy.invokeSuper(MethodProxy.java:244)
    at org.springframework.context.annotation.ConfigurationClassEnhancer$BeanMethodInterceptor.intercept(ConfigurationClassEnhancer.java:363)
    at com.walmart.intech.armada.api.DatabaseConfiguration$$EnhancerBySpringCGLIB$$cb8ee2a6.documentDbTemplate(<generated>)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:154)
    ... 113 common frames omitted
    Suppressed: java.lang.Exception: #block terminated with an error
        at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:93)
        ... 141 common frames omitted
    Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: no further information: sql-dev-azure-cosmosdb.documents.azure.com/104.215.113.77:4435
    Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException: 
    Assembly trace from producer [reactor.core.publisher.MonoCreate] :
    reactor.core.publisher.Mono.create(Mono.java:185)
    reactor.netty.resources.PooledConnectionProvider$PooledConnectionAllocator.connectChannel(PooledConnectionProvider.java:241)
    Error has been observed at the following site(s):
    |_        Mono.create ⇢ at reactor.netty.resources.PooledConnectionProvider$PooledConnectionAllocator.connectChannel(PooledConnectionProvider.java:241)
    |_        Mono.create ⇢ at reactor.netty.resources.PooledConnectionProvider.acquire(PooledConnectionProvider.java:135)
    |_        Mono.create ⇢ at reactor.netty.http.client.HttpClientConnect$MonoHttpConnect.subscribe(HttpClientConnect.java:284)
    |_         Mono.retry ⇢ at reactor.netty.http.client.HttpClientConnect$MonoHttpConnect.subscribe(HttpClientConnect.java:321)
    |_   Mono.flatMapMany ⇢ at reactor.netty.http.client.HttpClientFinalizer.responseConnection(HttpClientFinalizer.java:85)
    |_        Flux.single ⇢ at com.azure.data.cosmos.internal.http.ReactorNettyClient.send(ReactorNettyClient.java:106)
    |_       Mono.flatMap ⇢ at com.azure.data.cosmos.internal.directconnectivity.HttpClientUtils.parseResponseAsync(HttpClientUtils.java:19)
    |_           Mono.map ⇢ at com.azure.data.cosmos.internal.directconnectivity.GatewayServiceConfigurationReader.toDatabaseAccountObservable(GatewayServiceConfigurationReader.java:157)
    |_         Mono.error ⇢ at com.azure.data.cosmos.internal.GlobalEndpointManager.lambda$getDatabaseAccountFromAnyLocationsAsync$2(GlobalEndpointManager.java:97)
    |_ Mono.onErrorResume ⇢ at com.azure.data.cosmos.internal.GlobalEndpointManager.getDatabaseAccountFromAnyLocationsAsync(GlobalEndpointManager.java:93)
    |_   Mono.doOnSuccess ⇢ at com.azure.data.cosmos.internal.directconnectivity.GatewayServiceConfigurationReader.initializeReaderAsync(GatewayServiceConfigurationReader.java:142)
    Stack trace:
    Caused by: java.net.ConnectException: Connection timed out: no further information
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:336)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:685)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:748)
  4. at this point new threads are created after

    now u can see parallel-6 7 8 and cosmos-rntbd-nio-7- are created and the old parallel-12345 and cosmos-rntbd-nio-2- still exist. and the parallel-12345 is taking the wrong URL as connection and it will keep trying to connect to the wrong URL after some period in the backend. But the controller is taking the correct URL without issue.

So, what is the way to kill those old threads before refresh?

kushagraThapar commented 4 years ago

Try closing the client properly using close() API on CosmosClient and re-opening it with new connection details when you are calling refreshScope

loudywen commented 4 years ago

I did try it. but still, it looks like nothing happen. that cosmoclient is not killing the asyncDocumentClient

loudywen commented 4 years ago
  @EventListener(RefreshScopeRefreshedEvent.class)
     public synchronized void onRefresh(RefreshScopeRefreshedEvent event) {
          CosmosClient cc = (CosmosClient) this.applicationContext.getBean("cosmosClient");
          cc.close();
}

I have this listener for closing the client. but still the parallel will still stay there and just keep create a new one for the new connection

loudywen commented 4 years ago

@kushagraThapar can u please give any suggestions?

loudywen commented 4 years ago

Is this the bug?

kushagraThapar commented 4 years ago

@loudywen - I don't think that should be a bug. However, there is a way to find out. Can you please enable debug logs for cosmos SDK and see the below logs when you close the client ?

GlobalEndpointManager closed.
loudywen commented 4 years ago

@kushagraThapar I enabled the debug. but I dont see GlobalEndpointManager closed.

loudywen commented 4 years ago
2020-08-11 19:31:57,620 INFO  [http-nio-8090-exec-1] c.a.d.c.i.RxDocumentClientImpl  <init> -  Initializing DocumentClient with serviceEndpoint [https://xxxx:4435/], connectionPolicy [ConnectionPolicy{requestTimeoutInMillis=2000, mediaRequestTimeoutInMillis=300000, connectionMode=DIRECT, maxPoolSize=1000, idleConnectionTimeoutInMillis=5000, userAgentSuffix=';spring-data/2.2.4;', retryOptions=RetryOptions{maxRetryAttemptsOnThrottledRequests=0, maxRetryWaitTimeInSeconds=3}, enableEndpointDiscovery=true, preferredLocations=null, usingMultipleWriteLocations=false, inetSocketProxyAddress=null}], consistencyLevel [Session], directModeProtocol [Tcp]
2020-08-11 19:31:57,622 DEBUG [http-nio-8090-exec-1] c.a.d.c.i.GlobalEndpointManager  startRefreshLocationTimerAsync -  registering a refresh in [300000] ms
2020-08-11 19:31:57,636 DEBUG [parallel-6] c.a.d.c.i.GlobalEndpointManager  lambda$startRefreshLocationTimerAsync$11 -  startRefreshLocationTimerAsync() - Invoking refresh, I was registered on [2020-08-11T14:31:57.623]
2020-08-11 19:31:57,637 INFO  [parallel-6] c.a.d.c.i.RxDocumentClientImpl  getDatabaseAccountFromEndpoint -  Getting database account endpoint from https://sql-dev-azure-cosmosdb.documents.azure.com:4435/
2020-08-11 19:32:18,749 ERROR [reactor-http-nio-8] c.a.d.c.internal.RxGatewayStoreModel  lambda$toDocumentServiceResponse$2 -  Network failure
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: no further information: xxx.com/xxx.215.113.xx:4435
    Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException: 
Assembly trace from producer [reactor.core.publisher.MonoCreate] :
    reactor.core.publisher.Mono.create(Mono.java:185)
    reactor.netty.resources.PooledConnectionProvider$PooledConnectionAllocator.connectChannel(PooledConnectionProvider.java:241)
Error has been observed at the following site(s):
    |_      Mono.create ⇢ at reactor.netty.resources.PooledConnectionProvider$PooledConnectionAllocator.connectChannel(PooledConnectionProvider.java:241)
    |_      Mono.create ⇢ at reactor.netty.resources.PooledConnectionProvider.acquire(PooledConnectionProvider.java:135)
    |_      Mono.create ⇢ at reactor.netty.http.client.HttpClientConnect$MonoHttpConnect.subscribe(HttpClientConnect.java:284)
    |_       Mono.retry ⇢ at reactor.netty.http.client.HttpClientConnect$MonoHttpConnect.subscribe(HttpClientConnect.java:321)
    |_ Mono.flatMapMany ⇢ at reactor.netty.http.client.HttpClientFinalizer.responseConnection(HttpClientFinalizer.java:85)
    |_      Flux.single ⇢ at com.azure.data.cosmos.internal.http.ReactorNettyClient.send(ReactorNettyClient.java:106)
    |_     Mono.flatMap ⇢ at com.azure.data.cosmos.internal.RxGatewayStoreModel.toDocumentServiceResponse(RxGatewayStoreModel.java:246)
    |_         Mono.map ⇢ at com.azure.data.cosmos.internal.RxGatewayStoreModel.toDocumentServiceResponse(RxGatewayStoreModel.java:283)
Stack trace:
Caused by: java.net.ConnectException: Connection timed out: no further information
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:336)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:685)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:748)
2020-08-11 19:32:18,755 WARN  [reactor-http-nio-8] c.a.d.c.i.WebExceptionRetryPolicy  shouldRetry -  Received retriable web exception, will retry
com.azure.data.cosmos.CosmosClientException: null
    at com.azure.data.cosmos.BridgeInternal.createCosmosClientException(BridgeInternal.java:346)
    Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException: 
Assembly trace from producer [reactor.core.publisher.MonoError] :
    reactor.core.publisher.Mono.error(Mono.java:281)
    com.azure.data.cosmos.internal.RxGatewayStoreModel.lambda$toDocumentServiceResponse$2(RxGatewayStoreModel.java:298)
Error has been observed at the following site(s):
    |_         Mono.error ⇢ at com.azure.data.cosmos.internal.RxGatewayStoreModel.lambda$toDocumentServiceResponse$2(RxGatewayStoreModel.java:298)
    |_ Mono.onErrorResume ⇢ at com.azure.data.cosmos.internal.RxGatewayStoreModel.toDocumentServiceResponse(RxGatewayStoreModel.java:284)
    |_        Flux.single ⇢ at com.azure.data.cosmos.internal.RxGatewayStoreModel.lambda$invokeAsync$3(RxGatewayStoreModel.java:365)
    |_         Mono.defer ⇢ at com.azure.data.cosmos.internal.BackoffRetryUtility.executeRetry(BackoffRetryUtility.java:31)
Stack trace:
        at com.azure.data.cosmos.BridgeInternal.createCosmosClientException(BridgeInternal.java:346)
        at com.azure.data.cosmos.internal.RxGatewayStoreModel.lambda$toDocumentServiceResponse$2(RxGatewayStoreModel.java:296)
        at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:88)
        at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onError(FluxMapFuseable.java:134)
        at reactor.core.publisher.MonoFlatMap$FlatMapMain.onError(MonoFlatMap.java:165)
        at reactor.core.publisher.MonoSingle$SingleSubscriber.onError(MonoSingle.java:141)
        at reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain.onError(MonoFlatMapMany.java:197)
        at reactor.core.publisher.FluxRetryPredicate$RetryPredicateSubscriber.onError(FluxRetryPredicate.java:101)
        at reactor.core.publisher.MonoCreate$DefaultMonoSink.error(MonoCreate.java:183)
        at reactor.netty.http.client.HttpClientConnect$MonoHttpConnect$TcpClientSubscriber.onError(HttpClientConnect.java:345)
        at reactor.core.publisher.MonoCreate$DefaultMonoSink.error(MonoCreate.java:183)
        at reactor.netty.resources.PooledConnectionProvider$DisposableAcquire.onError(PooledConnectionProvider.java:492)
        at reactor.netty.internal.shaded.reactor.pool.AbstractPool$Borrower.fail(AbstractPool.java:378)
        at reactor.netty.internal.shaded.reactor.pool.SimplePool.lambda$drainLoop$7(SimplePool.java:203)
        at reactor.core.publisher.LambdaMonoSubscriber.doError(LambdaMonoSubscriber.java:152)
        at reactor.core.publisher.LambdaMonoSubscriber.onError(LambdaMonoSubscriber.java:147)
        at reactor.core.publisher.MonoCreate$DefaultMonoSink.error(MonoCreate.java:183)
        at reactor.netty.resources.PooledConnectionProvider$PooledConnectionAllocator$PooledConnectionInitializer.operationComplete(PooledConnectionProvider.java:303)
        at reactor.netty.resources.PooledConnectionProvider$PooledConnectionAllocator$PooledConnectionInitializer.operationComplete(PooledConnectionProvider.java:254)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:500)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:493)
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:472)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:413)
        at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:538)
        at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:531)
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:111)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:323)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:339)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:685)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:632)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:549)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: no further information: xxx:4435
    Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException: 
Assembly trace from producer [reactor.core.publisher.MonoCreate] :
    reactor.core.publisher.Mono.create(Mono.java:185)
    reactor.netty.resources.PooledConnectionProvider$PooledConnectionAllocator.connectChannel(PooledConnectionProvider.java:241)
Error has been observed at the following site(s):
    |_      Mono.create ⇢ at reactor.netty.resources.PooledConnectionProvider$PooledConnectionAllocator.connectChannel(PooledConnectionProvider.java:241)
    |_      Mono.create ⇢ at reactor.netty.resources.PooledConnectionProvider.acquire(PooledConnectionProvider.java:135)
    |_      Mono.create ⇢ at reactor.netty.http.client.HttpClientConnect$MonoHttpConnect.subscribe(HttpClientConnect.java:284)
    |_       Mono.retry ⇢ at reactor.netty.http.client.HttpClientConnect$MonoHttpConnect.subscribe(HttpClientConnect.java:321)
    |_ Mono.flatMapMany ⇢ at reactor.netty.http.client.HttpClientFinalizer.responseConnection(HttpClientFinalizer.java:85)
    |_      Flux.single ⇢ at com.azure.data.cosmos.internal.http.ReactorNettyClient.send(ReactorNettyClient.java:106)
    |_     Mono.flatMap ⇢ at com.azure.data.cosmos.internal.RxGatewayStoreModel.toDocumentServiceResponse(RxGatewayStoreModel.java:246)
    |_         Mono.map ⇢ at com.azure.data.cosmos.internal.RxGatewayStoreModel.toDocumentServiceResponse(RxGatewayStoreModel.java:283)

@kushagraThapar from the log u can see. there is no closed happen.

loudywen commented 4 years ago
2020-08-11 19:34:32,360 DEBUG [reactor-http-nio-3] c.a.d.c.i.GlobalEndpointManager  lambda$startRefreshLocationTimerAsync$10 -  db account retrieved
2020-08-11 19:34:32,360 DEBUG [reactor-http-nio-3] c.a.d.c.i.GlobalEndpointManager  lambda$refreshLocationPrivateAsync$9 -  refreshLocationPrivateAsync() refreshing locations
2020-08-11 19:34:32,518 DEBUG [reactor-http-nio-3] c.a.d.c.i.GlobalEndpointManager  lambda$refreshLocationPrivateAsync$9 -  shouldRefreshEndpoints: false, nothing to do.
2020-08-11 19:34:32,669 DEBUG [http-nio-8090-exec-2] c.a.d.c.i.GlobalEndpointManager  lambda$refreshLocationAsync$6 -  refreshLocationAsync() invoked
2020-08-11 19:34:32,670 DEBUG [http-nio-8090-exec-2] c.a.d.c.i.GlobalEndpointManager  lambda$refreshLocationAsync$6 -  will refresh
2020-08-11 19:34:32,671 DEBUG [http-nio-8090-exec-2] c.a.d.c.i.GlobalEndpointManager  lambda$refreshLocationPrivateAsync$9 -  refreshLocationPrivateAsync() refreshing locations
2020-08-11 19:34:32,672 DEBUG [http-nio-8090-exec-2] c.a.d.c.i.GlobalEndpointManager  lambda$refreshLocationPrivateAsync$9 -  shouldRefreshEndpoints: false, nothing to do.

here is the log when I gave the correct URL. so during the refresScope trigger, there no closed happen! @kushagraThapar

loudywen commented 4 years ago
2020-08-11 19:43:22,035 DEBUG [parallel-3] c.a.d.c.i.GlobalEndpointManager  lambda$startRefreshLocationTimerAsync$11 -  startRefreshLocationTimerAsync() - Invoking refresh, I was registered on [2020-08-11T14:38:22.027]
2020-08-11 19:43:22,035 INFO  [parallel-3] c.a.d.c.i.RxDocumentClientImpl  getDatabaseAccountFromEndpoint -  Getting database account endpoint from https://xxxx:4435/

@kushagraThapar see this. after some period of time. some of the paraller-x thread will pick up the previous wrong value and try to connect...

I mean, I guess there is something that triggers the RxDocumentClientImpl to init the connection with the wrong value. but I have no idea who is triggering it. This issue can be solved if I am able to close the previous cosmos threads. Please help,.

kushagraThapar commented 4 years ago

@loudywen - thanks for attaching the logs, however, I am not seeing any logs emitting for even closing cosmos client. That indicates that your onRefresh API is not getting called. If it would have been getting called, you should see logs something like this:

10:45:56.700 [reactor-http-nio-3] INFO  c.a.c.i.RxDocumentClientImpl - Shutting down ...
10:45:56.700 [reactor-http-nio-3] INFO  c.a.c.i.RxDocumentClientImpl - Closing Global Endpoint Manager ...
10:45:56.701 [reactor-http-nio-3] INFO  c.a.c.i.RxDocumentClientImpl - Closing StoreClientFactory ...
10:45:56.703 [reactor-http-nio-3] INFO  c.a.c.i.RxDocumentClientImpl - Shutting down reactorHttpClient ...
10:45:56.710 [reactor-http-nio-3] INFO  c.a.c.i.RxDocumentClientImpl - Shutting down completed.
loudywen commented 4 years ago

@kushagraThapar, I am not explicitly calling the refresh endpoint.

like I said I just have POJO annotated with @ConfigurationProperties and also the configuration class https://github.com/sophiaso/spring-cosmosdb-refresh/blob/master/src/main/java/com/example/cosmosdbrefresh/CosmosConfiguration.java ) .

and use @RefreshScope

once the CCM updated, I just let the Spring handle the rest of the stuff. I didn't hit any refresh endpoint at all.

can you provide me an example of how to trigger the refresh programmatically?

loudywen commented 4 years ago
2020-08-11 20:36:52,056 INFO  [scm-timer-thread-1] c.a.d.c.i.RxDocumentClientImpl  close -  Shutting down ...
2020-08-11 20:36:52,056 DEBUG [scm-timer-thread-1] c.a.d.c.i.GlobalEndpointManager  close -  GlobalEndpointManager closed.

I call this within the EnvironmentChangeEvent and I am able to see above two log only. @kushagraThapar

 CosmosClient cosmosClient = (CosmosClient) this.applicationContext.getBean("cosmosClient");
          cosmosClient.close();;

startRefreshLocationTimerAsync keep triggering the previous value.

2020-08-11 21:09:00,257 DEBUG [parallel-8] c.a.d.c.i.GlobalEndpointManager  lambda$startRefreshLocationTimerAsync$11 -  startRefreshLocationTimerAsync() - Invoking refresh, I was registered on [2020-08-11T16:04:00.254]
loudywen commented 4 years ago

@kushagraThapar can u try from your end and see if the connection (threads) able to close/kill. I still think there is something wrong here.

kushagraThapar commented 4 years ago

@loudywen - sure, I will try it out and will let you know if I see these threads hanging around.

kushagraThapar commented 4 years ago

@loudywen - meanwhile, can you please provide full set of logs ?

loudywen commented 4 years ago

@kushagraThapar here u go spring-boot-logger.log

again the scenario I am testing is:

  1. ccm and application started
  2. go to ccm and give some wrong URL value
  3. application able to pick up the updated value and recreate the connection with the wrong URL value
  4. timeout exception raised since the URL is wrong.
  5. go to ccm and give the correct URL value
  6. application able to pick the updated value and recreate the connection with the correct URL value
  7. able to received data with the new connection from the front end.
  8. however, some old cosmos thread (parallel-x) will try to reconnect with the previous wrong URL value in the unpredictable time and raised an exception in the backend.

I just don't understand why those previous values still exist and picked up by some of those parallel-x threads. my assumption is. if the cosmos able to shut down the old thread pool correctly and recreate a new one during refresh, this issue should be solved.

kushagraThapar commented 4 years ago

@loudywen - thanks for providing the detailed logs, I looked at them and I can't find where actually the client got closed and reinstantiated.

Since spring creates a bean initially for CosmosClient, I think when you update the URL through your application, the existing CosmosClient is not getting closed. I am still not sure how you have hooked up the refresh in your application - but what I am trying to say is that you will have to close the client manually and recreate it from scratch with new URL and key.

That being said, you will have to find out how spring is managing the CosmosClient bean, and how you can reinstantiate that bean.

Since this repository is of @sophiaso, I guess he will have much more information on this.

@sophiaso - can you please help us here ? The problem is old client is not getting closed - which is leaving some old threads behind.

loudywen commented 4 years ago

@kushagraThapar @sophiaso ya, close is not happening even I call the refresh (manually). that is why I said something is wrong here. @sophiaso if you run your sample and do the same scenario I did. I think u should see the same issue.

The code that @sophiaso shows in his repo should be the correct way to closes the bean and re-instantiate the bean and I don't see any issue of it

Also, the other way I tried to close the cosmosClient is just directly calling the cosmosclient.close() after my application started, still, none of the thread pool is being closed.

loudywen commented 4 years ago

@kushagraThapar Have you try to test it on your side?

kushagraThapar commented 4 years ago

@loudywen - I tested this issue with the sample code for spring-data-cosmosdb repo and I couldn't repro the issue at all. I see that closing the client actually closes GlobalEndpointManager and there are no threads hanging around related to that.

However, I am speculating the issue with your application is that there are multiple clients and not all of them are getting closed properly in your application. I would suggest making sure all the clients are closed properly on your end. Please debug the application on your end and see if the clients are getting initialized and then are getting closed properly.

Like in the above error logs that you shared, there were no close calls happening. I think that is the major issue.

loudywen commented 4 years ago

@kushagraThapar thanks for the suggestion. But I do understand what you suggest and I do follow the steps that you asked to do. But the issue still exists no matter what.

By the way, what version are you using for test? I am using:

    <dependency>
            <groupId>com.microsoft.azure</groupId>
            <artifactId>spring-data-cosmosdb</artifactId>
            <version>2.2.4</version>
        </dependency>
kushagraThapar commented 4 years ago

@loudywen I tested this with spring-data-cosmosdb 2.3.0 and with the new version of azure-spring-data-cosmos that will be released today - 3.0.0-beta.1 And I couldn't repro the issue..

loudywen commented 4 years ago

@kushagraThapar so in the future, should I use spring-data-cosmosdb 2.3.0 or azure-spring-data-cosmos

I can see that the azure-spring-data-cosmos rewrote the AbstractCosmosConfiguration completely and look more reasonable than the one in spring-data-cosmosdb

kushagraThapar commented 4 years ago

@loudywen - new -> azure-spring-data-cosmos will be released today. You should use the new one going forward and let's also move this issue to the new version of the SDK, where it will much easier to debug the issue and fix it.

For your reference, here are the docs on the new version - https://github.com/Azure/azure-sdk-for-java/blob/master/sdk/cosmos/azure-spring-data-cosmos-core/README.md And here is the changelog : https://github.com/Azure/azure-sdk-for-java/pull/14187

This will be released today and you should be able to use it.

kushagraThapar commented 4 years ago

@loudywen closing this issue here, as we are tracking it in the new repo.