zipkin-server not work with kafka and elasticsearch (over capacity)

haixinke commented 3 years ago

Describe the Bug

We have zipkin-server 2.23.2 reading form kafka with zipkin-collector-kafka and the storage is elasticsearch. With high number of spans, we found that zipkin-server not work, old region of jvm is 100%，cpu is very high too,then server not working,ui query not working too. my run server script is : java -Dzipkin.collector.kafka.bootstrap-servers=192.168.30.72:9092 -Dzipkin.collector.kafka.topic=zipkin -Dzipkin.collector.kafka.groupId=zipkin -Dzipkin.collector.kafka.overrides.max.poll.interval.ms=300000 -Dzipkin.collector.kafka.overrides.max.poll.records=500 -Dzipkin.collector.kafka.overrides.auto.offset.reset=latest -Dzipkin.collector.kafka.streams=16 -Dzipkin.storage.type=elasticsearch -Dzipkin.storage.elasticsearch.hosts=192.168.30.72:19200 -Dzipkin.storage.elasticsearch.username=elastic -Dzipkin.storage.elasticsearch.password=123456 -jar zipkin-server-2.23.2.jar

If this is a UI issue...

In the logs we get the exception: 2021-05-26 08:13:46,548 [armeria-common-worker-epoll-2-13] WARN zipkin2.server.internal.BodyIsExceptionMessage (BodyIsExceptionMessage.java:41) - Unexpected error handling request. com.linecorp.armeria.common.ClosedSessionException: null at com.linecorp.armeria.common.ClosedSessionException.get(ClosedSessionException.java:36) ~[armeria-1.3.0.jar!/:?] at com.linecorp.armeria.server.HttpServerHandler.cleanup(HttpServerHandler.java:233) ~[armeria-1.3.0.jar!/:?] at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [netty-common-4.1.54.Final.jar!/:4.1.54.Final] at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) [netty-common-4.1.54.Final.jar!/:4.1.54.Final] at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) [netty-common-4.1.54.Final.jar!/:4.1.54.Final] at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) [netty-common-4.1.54.Final.jar!/:4.1.54.Final] at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) [netty-transport-native-epoll-4.1.54.Final-linux-x86_64.jar!/:4.1.54.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [netty-common-4.1.54.Final.jar!/:4.1.54.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.54.Final.jar!/:4.1.54.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-common-4.1.54.Final.jar!/:4.1.54.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181] 2021-05-26 08:21:45,809 [kafka-coordinator-heartbeat-thread | zipkin] INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator (AbstractCoordinator.java:904) - [Consumer clientId=consumer-zipkin-12, groupId=zipkin] Group coordinator 192.168.30.72:9092 (id: 2147483647 rack: null) is unavailable or invalid due to cause: null.isDisconnected: true. Rediscovery will be attempted. 2021-05-26 08:25:20,004 [kafka-coordinator-heartbeat-thread | zipkin] INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator (AbstractCoordinator.java:1029) - [Consumer clientId=consumer-zipkin-5, groupId=zipkin] Member consumer-zipkin-5-d48b4ded-2cd5-40d0-a36b-9f3dc5d3555b sending LeaveGroup request to coordinator 192.168.30.72:9092 (id: 2147483647 rack: null) due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.

haixinke commented 3 years ago

kafka_2.12-2.8.0 elasticsearch 7.10.2 CentOS 7 java version "1.8.0_181"

jcchavezs commented 3 years ago

Ping @jeqo

openzipkin / zipkin

zipkin-server not work with kafka and elasticsearch (over capacity) #3355

Describe the Bug