openzipkin / zipkin

Zipkin is a distributed tracing system
https://zipkin.io/
Apache License 2.0
17.02k stars 3.09k forks source link

zipkin-server not work with kafka and elasticsearch (over capacity) #3355

Open haixinke opened 3 years ago

haixinke commented 3 years ago

Describe the Bug

We have zipkin-server 2.23.2 reading form kafka with zipkin-collector-kafka and the storage is elasticsearch. With high number of spans, we found that zipkin-server not work, old region of jvm is 100%,cpu is very high too,then server not working,ui query not working too. my run server script is : java -Dzipkin.collector.kafka.bootstrap-servers=192.168.30.72:9092 -Dzipkin.collector.kafka.topic=zipkin -Dzipkin.collector.kafka.groupId=zipkin -Dzipkin.collector.kafka.overrides.max.poll.interval.ms=300000 -Dzipkin.collector.kafka.overrides.max.poll.records=500 -Dzipkin.collector.kafka.overrides.auto.offset.reset=latest -Dzipkin.collector.kafka.streams=16 -Dzipkin.storage.type=elasticsearch -Dzipkin.storage.elasticsearch.hosts=192.168.30.72:19200 -Dzipkin.storage.elasticsearch.username=elastic -Dzipkin.storage.elasticsearch.password=123456 -jar zipkin-server-2.23.2.jar

If this is a UI issue... image image

In the logs we get the exception: 2021-05-26 08:13:46,548 [armeria-common-worker-epoll-2-13] WARN zipkin2.server.internal.BodyIsExceptionMessage (BodyIsExceptionMessage.java:41) - Unexpected error handling request. com.linecorp.armeria.common.ClosedSessionException: null at com.linecorp.armeria.common.ClosedSessionException.get(ClosedSessionException.java:36) ~[armeria-1.3.0.jar!/:?] at com.linecorp.armeria.server.HttpServerHandler.cleanup(HttpServerHandler.java:233) ~[armeria-1.3.0.jar!/:?] at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [netty-common-4.1.54.Final.jar!/:4.1.54.Final] at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) [netty-common-4.1.54.Final.jar!/:4.1.54.Final] at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) [netty-common-4.1.54.Final.jar!/:4.1.54.Final] at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) [netty-common-4.1.54.Final.jar!/:4.1.54.Final] at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) [netty-transport-native-epoll-4.1.54.Final-linux-x86_64.jar!/:4.1.54.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [netty-common-4.1.54.Final.jar!/:4.1.54.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.54.Final.jar!/:4.1.54.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-common-4.1.54.Final.jar!/:4.1.54.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181] 2021-05-26 08:21:45,809 [kafka-coordinator-heartbeat-thread | zipkin] INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator (AbstractCoordinator.java:904) - [Consumer clientId=consumer-zipkin-12, groupId=zipkin] Group coordinator 192.168.30.72:9092 (id: 2147483647 rack: null) is unavailable or invalid due to cause: null.isDisconnected: true. Rediscovery will be attempted. 2021-05-26 08:25:20,004 [kafka-coordinator-heartbeat-thread | zipkin] INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator (AbstractCoordinator.java:1029) - [Consumer clientId=consumer-zipkin-5, groupId=zipkin] Member consumer-zipkin-5-d48b4ded-2cd5-40d0-a36b-9f3dc5d3555b sending LeaveGroup request to coordinator 192.168.30.72:9092 (id: 2147483647 rack: null) due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.

haixinke commented 3 years ago

kafka_2.12-2.8.0 elasticsearch 7.10.2 CentOS 7 java version "1.8.0_181"

jcchavezs commented 3 years ago

Ping @jeqo