orange-cloudfoundry / osb-cmdb

A configuration management db for Open Service Broker API broker implementations
Apache License 2.0
14 stars 1 forks source link

Improve diagnostics and contention over reactor assembly trace #82

Open gberche-orange opened 4 years ago

gberche-orange commented 4 years ago

Expected behavior

As an osb-cmdb operator

Observed behavior

reactor assembly trace thread contention

In the stack trace available at https://gist.github.com/gberche-orange/efcbedcd9f5c715dcd5e1505eb5503cd

46 thread out of 47 threads are waiting on the reactor assembly at https://github.com/reactor/reactor-core/blob/ef5d9c0d0f1b65b6669273a1ac651d2d6a019c15/reactor-core/src/main/java/reactor/core/publisher/FluxOnAssembly.java#L262 with stack trace:

 "http-nio-8080-exec-140" #171 daemon prio=5 os_prio=0 tid=0x000055a8f53aa800 nid=0x146 waiting for monitor entry [0x00007fadcac24000]
    java.lang.Thread.State: BLOCKED (on object monitor)
    at reactor.core.publisher.FluxOnAssembly$OnAssemblyException.add(FluxOnAssembly.java:262)
    - waiting to lock <0x00000000afa68fd8> (a java.util.LinkedList)
    at reactor.core.publisher.FluxOnAssembly$OnAssemblyException.add(FluxOnAssembly.java:254)
    at reactor.core.publisher.FluxOnAssembly$OnAssemblySubscriber.fail(FluxOnAssembly.java:458)
    at reactor.core.publisher.FluxOnAssembly$OnAssemblySubscriber.onError(FluxOnAssembly.java:390)
    at reactor.core.publisher.Operators$MonoSubscriber.onError(Operators.java:1759)
    at reactor.core.publisher.MonoCacheTime.subscribeOrReturn(MonoCacheTime.java:146)
    at reactor.core.publisher.Mono.subscribe(Mono.java:4195)
    at reactor.core.publisher.FluxFlatMap.trySubscribeScalarMap(FluxFlatMap.java:199)
    at reactor.core.publisher.MonoFlatMap.subscribeOrReturn(MonoFlatMap.java:53)

Surprisingly, no thread is observed to currently lock the LinkedList

cf-java-client high contention & completion times

Some cf-java-client calls take more than 60 mins to complete, see below queries before and after JVM restarts

grep -h 'FINISH' cftraces* | sort | cut -d ' ' -f 4,15- 
2020-11-17T15:34:18.87+0100          : FINISH Get Organization (onError/1.0 h)
2020-11-17T15:34:30.07+0100          : FINISH Get Organization (onError/47.4 m)
2020-11-17T15:54:28.25+0100          : FINISH Get Organization (onError/58.3 m)
2020-11-17T15:55:18.53+0100          : FINISH Get Organization (onError/1.6 h)
# JVM restart at this time: 2020-11-17T15:56:07.00+0100 
2020-11-17T15:56:55.72+0100          : FINISH List Service Definitions (onComplete/1.5 s)
2020-11-17T15:57:36.65+0100          : FINISH Get Organization (onComplete/351 ms)
2020-11-17T15:57:37.56+0100          : FINISH Get Service Instance (onComplete/844 ms)
2020-11-17T15:57:37.73+0100          : FINISH Delete Service Key (onError/158 ms)
2020-11-17T15:57:37.89+0100          : FINISH Get Service Key (onError/126 ms)
2020-11-17T15:59:00.67+0100          : FINISH Get Organization (onComplete/229 ms)
2020-11-17T15:59:00.95+0100          : FINISH Get Service Instance (onError/195 ms)
2020-11-17T15:59:11.78+0100          : FINISH Get Organization (onComplete/322 ms)
2020-11-17T15:59:12.02+0100          : FINISH Get Organization (onComplete/470 ms)
2020-11-17T15:59:12.35+0100          : FINISH Get Service Instance (onError/331 ms)
2020-11-17T15:59:12.40+0100          : FINISH Get Service Instance (onError/315 ms)
2020-11-17T15:59:54.29+0100          : FINISH Get Organization (onComplete/195 ms)
[...]

reactor stack trace keep trashing

Apparent signs of reactive loops when handling uaa error, see fragment below, and larger fragment in gist

https://gist.github.com/gberche-orange/3282249f8d5d08a1ab35b7bc5a5488a0

Need to further study https://projectreactor.io/docs/core/release/reference/#debug-activate

 2020-11-17 14:35:09.278  INFO 8 --- [nio-8080-exec-8] c.o.o.o.s.OsbCmdbServiceInstance         : Unable to provision service, caught:org.cloudfoundry.uaa.UaaException: null: null
 org.cloudfoundry.uaa.UaaException: null: null
    at org.cloudfoundry.reactor.util.ErrorPayloadMappers.lambda$null$8(ErrorPayloadMappers.java:89) ~[cloudfoundry-client-reactor-4.8.0.RELEASE.jar:na]
    Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException:
 Assembly trace from producer [reactor.core.publisher.MonoFlatMap] :
    reactor.core.publisher.Mono.flatMap(Mono.java:2734)
    org.cloudfoundry.reactor.util.ErrorPayloadMappers.lambda$mapToError$12(ErrorPayloadMappers.java:110)
 Error has been observed at the following site(s):
    |_                     Mono.flatMap ⇢ at org.cloudfoundry.reactor.util.ErrorPayloadMappers.lambda$mapToError$12(ErrorPayloadMappers.java:110)
    |_                     Flux.flatMap ⇢ at org.cloudfoundry.reactor.util.ErrorPayloadMappers.lambda$uaa$9(ErrorPayloadMappers.java:84)
    |_                   Flux.transform ⇢ at org.cloudfoundry.reactor.util.Operator$ResponseReceiver.parseBodyToToken(Operator.java:200)
    |_                     Flux.flatMap ⇢ at org.cloudfoundry.reactor.util.Operator$ResponseReceiver.parseBodyToToken(Operator.java:202)
    |_                     Flux.flatMap ⇢ at org.cloudfoundry.reactor.util.Operator$ResponseReceiver.parseBodyToToken(Operator.java:211)
    |_               Flux.singleOrEmpty ⇢ at org.cloudfoundry.reactor.util.Operator$ResponseReceiver.parseBodyToToken(Operator.java:211)
    |_                     Mono.flatMap ⇢ at org.cloudfoundry.reactor.tokenprovider.AbstractUaaTokenProvider.requestToken(AbstractUaaTokenProvider.java:232)
    |_ FluxMap$MapSubscriber.onComplete ⇢ at reactor.netty.channel.FluxReceive.terminateReceiver(FluxReceive.java:421)
    |_               Mono.onErrorResume ⇢ at org.cloudfoundry.reactor.tokenprovider.AbstractUaaTokenProvider.refreshToken(AbstractUaaTokenProvider.java:217)
    |_               Mono.doOnSubscribe ⇢ at org.cloudfoundry.reactor.tokenprovider.AbstractUaaTokenProvider.lambda$token$13(AbstractUaaTokenProvider.java:248)
    |_                     Mono.flatMap ⇢ at org.cloudfoundry.reactor.tokenprovider.AbstractUaaTokenProvider.token(AbstractUaaTokenProvider.java:247)
    |_               Mono.switchIfEmpty ⇢ at org.cloudfoundry.reactor.tokenprovider.AbstractUaaTokenProvider.token(AbstractUaaTokenProvider.java:249)
    |_                       Mono.cache ⇢ at java.util.Optional.orElseGet(Optional.java:267)
    |_                  Mono.checkpoint ⇢ at org.cloudfoundry.reactor.tokenprovider.AbstractUaaTokenProvider.token(AbstractUaaTokenProvider.java:255)
    |_                         Mono.map ⇢ at org.cloudfoundry.reactor.util.AbstractReactorOperations.addHeadersWhen(AbstractReactorOperations.java:65)
    |_                         Mono.map ⇢ at reactor.netty.http.client.HttpClientHeadersWhen.lambda$apply$0(HttpClientHeadersWhen.java:54)
    |_                     Mono.flatMap ⇢ at reactor.netty.http.client.HttpClientConfiguration.lambda$deferredConf$9(HttpClientConfiguration.java:213)
    |_                   Mono.transform ⇢ at reactor.netty.http.client.HttpClientConnect$HttpTcpClient.connect(HttpClientConnect.java:165)
    |_                     Mono.flatMap ⇢ at reactor.netty.http.client.HttpClientConnect$HttpTcpClient.connect(HttpClientConnect.java:166)
    |_                 Mono.flatMapMany ⇢ at reactor.netty.http.client.HttpClientFinalizer.responseConnection(HttpClientFinalizer.java:85)
    |_                    Flux.doOnNext ⇢ at org.cloudfoundry.reactor.util.Operator$ResponseReceiver.invalidateToken(Operato

logs are lacking thread names

making it hard to distinguish concurrent threads dumping the same entries such as:

   2020-11-17T15:35:51.86+0100 [APP/PROC/WEB/0] OUT     |_               Mono.switchIfEmpty ⇢ at com.orange.oss.osbcmdb.AbstractOsbCmdbService.createSpace(AbstractOsbCmdbService.java:112)
   2020-11-17T15:35:51.86+0100 [APP/PROC/WEB/0] OUT     |_                       Mono.block ⇢ at com.orange.oss.osbcmdb.AbstractOsbCmdbService.getSpaceScopedOperations(AbstractOsbCmdbService.java:65)
   2020-11-17T15:35:51.86+0100 [APP/PROC/WEB/0] OUT     |_                         Mono.map ⇢ at com.orange.oss.osbcmdb.AbstractOsbCmdbService.getOrganizationId(AbstractOsbCmdbService.java:130)
   2020-11-17T15:35:51.86+0100 [APP/PROC/WEB/0] OUT     |_                     Mono.flatMap ⇢ at com.orange.oss.osbcmdb.AbstractOsbCmdbService.lambda$getSpaceId$13(AbstractOsbCmdbService.java:136)
   2020-11-17T15:35:51.86+0100 [APP/PROC/WEB/0] OUT     |_                     Mono.flatMap ⇢ at com.orange.oss.osbcmdb.AbstractOsbCmdbService.getSpaceId(AbstractOsbCmdbService.java:135)
   2020-11-17T15:35:51.86+0100 [APP/PROC/WEB/0] OUT     |_               Mono.switchIfEmpty ⇢ at com.orange.oss.osbcmdb.AbstractOsbCmdbService.createSpace(AbstractOsbCmdbService.java:112)
   2020-11-17T15:35:51.86+0100 [APP/PROC/WEB/0] OUT     |_                       Mono.block ⇢ at com.orange.oss.osbcmdb.AbstractOsbCmdbService.getSpaceScopedOperations(AbstractOsbCmdbService.java:65)
   2020-11-17T15:35:51.86+0100 [APP/PROC/WEB/0] OUT     |_                         Mono.map ⇢ at com.orange.oss.osbcmdb.AbstractOsbCmdbService.getOrganizationId(AbstractOsbCmdbService.java:130)
   2020-11-17T15:35:51.86+0100 [APP/PROC/WEB/0] OUT     |_                     Mono.flatMap ⇢ at com.orange.oss.osbcmdb.AbstractOsbCmdbService.lambda$getSpaceId$13(AbstractOsbCmdbService.java:136)
   2020-11-17T15:35:51.86+0100 [APP/PROC/WEB/0] OUT     |_                     Mono.flatMap ⇢ at com.orange.oss.osbcmdb.AbstractOsbCmdbService.getSpaceId(AbstractOsbCmdbService.java:135)
   2020-11-17T15:35:51.86+0100 [APP/PROC/WEB/0] OUT     |_               Mono.switchIfEmpty ⇢ at com.orange.oss.osbcmdb.AbstractOsbCmdbService.createSpace(AbstractOsbCmdbService.java:112)
   2020-11-17T15:35:51.86+0100 [APP/PROC/WEB/0] OUT     |_                       Mono.block ⇢ at com.orange.oss.osbcmdb.AbstractOsbCmdbService.getSpaceScopedOperations(AbstractOsbCmdbService.java:65)
   2020-11-17T15:35:51.86+0100 [APP/PROC/WEB/0] OUT     |_                         Mono.map ⇢ at com.orange.oss.osbcmdb.AbstractOsbCmdbService.getOrganizationId(AbstractOsbCmdbService.java:130)
   2020-11-17T15:35:51.86+0100 [APP/PROC/WEB/0] OUT     |_                     Mono.flatMap ⇢ at com.orange.oss.osbcmdb.AbstractOsbCmdbService.lambda$getSpaceId$13(AbstractOsbCmdbService.java:136)
   2020-11-17T15:35:51.86+0100 [APP/PROC/WEB/0] OUT     |_                     Mono.flatMap ⇢ at com.orange.oss.osbcmdb.AbstractOsbCmdbService.getSpaceId(AbstractOsbCmdbService.java:135)
   2020-11-17T15:35:51.86+0100 [APP/PROC/WEB/0] OUT     |_               Mono.switchIfEmpty ⇢ at com.orange.oss.osbcmdb.AbstractOsbCmdbService.createSpace(AbstractOsbCmdbService.java:112)
   2020-11-17T15:35:51.86+0100 [APP/PROC/WEB/0] OUT     |_                       Mono.block ⇢ at com.orange.oss.osbcmdb.AbstractOsbCmdbService.getSpaceScopedOperations(AbstractOsbCmdbService.java:65)

Affected release

Reproduced on version 1.2.0

gberche-orange commented 3 years ago

Bump to https://github.com/cloudfoundry/cf-java-client/releases/tag/v4.16.0.RELEASE , see https://github.com/cloudfoundry/cf-java-client/pull/1118

Previously when a 401 would occur, we would throw the InvalidTokenException which in turn would trigger a retry. The retry would allow the TokenProvider to fetch a new, possibly valid token. When this happened, you could get into a state where reactor-netty was waiting for the library to finish using the connection (either consume the body or dispose it). Since we never did that, you could accumulate connections in the ESTABLISHED state. This code change will read and throw away the body when there's a 401, which signals to reactor-netty that it can reuse the connection.

No yet evidence that we received 401 though.