loki4j / loki-logback-appender

Fast and lightweight implementation of Logback appender for Grafana Loki
https://loki4j.github.io/loki-logback-appender/
BSD 2-Clause "Simplified" License
300 stars 26 forks source link

DefaultShutdownHook does not shutdown appender #175

Closed sstamm closed 1 year ago

sstamm commented 1 year ago

Randomly, the JVM does not terminate with the flowing StackTrace. Using 1.4.0-rc2. And only seeing the problem on Windows Server 2012R2, same App on Linux does not show the problem. Also just removing the Loki Appender (but still using Logback for RollingFileAppender) did not cause any hang.

* Thread group "main":

  Thread "HttpClient-1-SelectorManager":
    at sun.nio.ch.WEPoll.wait(long, long, int, int)
    at sun.nio.ch.WEPollSelectorImpl.doSelect(java.util.function.Consumer, long)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.util.function.Consumer, long)
    at sun.nio.ch.SelectorImpl.select(long)
    at jdk.internal.net.http.HttpClientImpl$SelectorManager.run()

  Thread "Logback shutdown hook [default]":
    at jdk.internal.misc.Unsafe.park(boolean, long)
    at java.util.concurrent.locks.LockSupport.parkNanos(long)
    at com.github.loki4j.client.pipeline.DefaultPipeline.waitSendQueueLessThan(int, long) (line: 341)
    at com.github.loki4j.client.pipeline.DefaultPipeline.waitSendQueueIsEmpty(long) (line: 334)
    at com.github.loki4j.client.pipeline.DefaultPipeline.stop() (line: 137)
    at com.github.loki4j.logback.Loki4jAppender.stop() (line: 167)
    at ch.qos.logback.core.spi.AppenderAttachableImpl.detachAndStopAllAppenders() (line: 107)
    at ch.qos.logback.classic.Logger.detachAndStopAllAppenders() (line: 209)
    at ch.qos.logback.classic.Logger.recursiveReset() (line: 333)
    at ch.qos.logback.classic.LoggerContext.reset() (line: 352)
    at ch.qos.logback.classic.LoggerContext.stop() (line: 332)
    at ch.qos.logback.core.hook.ShutdownHookBase.stop() (line: 40)
    at ch.qos.logback.core.hook.DefaultShutdownHook.run() (line: 62)
    at java.lang.Thread.run()

  Thread "loki4j-encoder-0":
    at jdk.internal.misc.Unsafe.park(boolean, long)
    at java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long)
    at com.github.loki4j.client.pipeline.DefaultPipeline.encodeStep(com.github.loki4j.client.batch.LogRecordBatch) (line: 200)
    at com.github.loki4j.client.pipeline.DefaultPipeline.runEncodeLoop() (line: 181)
    at com.github.loki4j.client.pipeline.DefaultPipeline.lambda$start$3() (line: 119)
    at com.github.loki4j.client.pipeline.DefaultPipeline$$Lambda$208.0x0000000800d45cc8.run()
    at java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run()
    at java.lang.Thread.run()

  Thread "loki4j-scheduler-0":
    at jdk.internal.misc.Unsafe.park(boolean, long)
    at java.util.concurrent.locks.LockSupport.park()
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block()
    at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
    at java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await()
    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take()
    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take()
    at java.util.concurrent.ThreadPoolExecutor.getTask()
    at java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run()
    at java.lang.Thread.run()

  Thread "loki4j-sender-0":
    at jdk.internal.misc.Unsafe.park(boolean, long)
    at java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long)
    at com.github.loki4j.client.pipeline.DefaultPipeline.sendStep() (line: 254)
    at com.github.loki4j.client.pipeline.DefaultPipeline.runSendLoop() (line: 191)
    at com.github.loki4j.client.pipeline.DefaultPipeline.lambda$start$2() (line: 116)
    at com.github.loki4j.client.pipeline.DefaultPipeline$$Lambda$207.0x0000000800d45aa8.run()
    at java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run()
    at java.lang.Thread.run()
nehaev commented 1 year ago

Hi @sstamm, thanks for reporting this!

It would be helpful if you can provide trace log for that as described in https://loki4j.github.io/loki-logback-appender/docs/tracing#enabling-tracing-mode (you would need to upgrade to v1.4.0)

As a workaround, could you please try to set drainOnStop to false, e.g.:

<appender name="LOKI" class="com.github.loki4j.logback.Loki4jAppender">
    <drainOnStop>false</drainOnStop>
    ...
</appender>
sstamm commented 1 year ago

Is it possible to to this with logback? Like: <logger name="AsyncBufferPipeline" level="TRACE" additivity="false"> Happens randomly at distributed workers, have to specify then a target per worker.

nehaev commented 1 year ago

This should be addressed in v1.4.1. Feel free to re-open in case the issue still remains.

sstamm commented 1 year ago

Still exists with 1.4.1:

hread dump at 0:03.689.780

* Thread group "main":

  Thread "HttpClient-1-SelectorManager":
    at sun.nio.ch.WEPoll.wait(long, long, int, int)
    at sun.nio.ch.WEPollSelectorImpl.doSelect(java.util.function.Consumer, long)
    at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.util.function.Consumer, long)
    at sun.nio.ch.SelectorImpl.select(long)
    at jdk.internal.net.http.HttpClientImpl$SelectorManager.run()

  Thread "Logback shutdown hook [default]":
    at jdk.internal.misc.Unsafe.park(boolean, long)
    at java.util.concurrent.locks.LockSupport.parkNanos(long)
    at com.github.loki4j.client.pipeline.AsyncBufferPipeline.waitSendQueueLessThan(int, long) (line: 370)
    at com.github.loki4j.client.pipeline.AsyncBufferPipeline.waitSendQueueIsEmpty(long) (line: 164)
    at com.github.loki4j.client.pipeline.AsyncBufferPipeline.stop() (line: 144)
    at com.github.loki4j.logback.Loki4jAppender.stop() (line: 167)
    at ch.qos.logback.core.spi.AppenderAttachableImpl.detachAndStopAllAppenders() (line: 107)
    at ch.qos.logback.classic.Logger.detachAndStopAllAppenders() (line: 209)
    at ch.qos.logback.classic.Logger.recursiveReset() (line: 333)
    at ch.qos.logback.classic.LoggerContext.reset() (line: 362)
    at ch.qos.logback.classic.LoggerContext.stop() (line: 342)
    at ch.qos.logback.core.hook.ShutdownHookBase.stop() (line: 40)
    at ch.qos.logback.core.hook.DefaultShutdownHook.run() (line: 62)
    at java.lang.Thread.run()

  Thread "loki4j-encoder-0":
    at jdk.internal.misc.Unsafe.park(boolean, long)
    at java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long)
    at com.github.loki4j.client.pipeline.AsyncBufferPipeline.encodeStep(com.github.loki4j.client.batch.LogRecordBatch) (line: 214)
    at com.github.loki4j.client.pipeline.AsyncBufferPipeline.runEncodeLoop() (line: 195)
    at com.github.loki4j.client.pipeline.AsyncBufferPipeline.lambda$start$4() (line: 122)
    at com.github.loki4j.client.pipeline.AsyncBufferPipeline$$Lambda$205.0x0000000800d451d8.run()
    at java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run()
    at java.lang.Thread.run()

  Thread "loki4j-scheduler-0":
    at jdk.internal.misc.Unsafe.park(boolean, long)
    at java.util.concurrent.locks.LockSupport.park()
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block()
    at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
    at java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await()
    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take()
    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take()
    at java.util.concurrent.ThreadPoolExecutor.getTask()
    at java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run()
    at java.lang.Thread.run()

  Thread "loki4j-sender-0":
    at jdk.internal.misc.Unsafe.park(boolean, long)
    at java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long)
    at com.github.loki4j.client.pipeline.AsyncBufferPipeline.sendStep() (line: 269)
    at com.github.loki4j.client.pipeline.AsyncBufferPipeline.runSendLoop() (line: 205)
    at com.github.loki4j.client.pipeline.AsyncBufferPipeline.lambda$start$3() (line: 119)
    at com.github.loki4j.client.pipeline.AsyncBufferPipeline$$Lambda$204.0x0000000800d44fb8.run()
    at java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run()
    at java.lang.Thread.run()
nehaev commented 1 year ago

@sstamm Ok, let's continue digging into this.

  1. Have you tried to set drainOnStop to false as I suggested earlier?
  2. If setting drainOnStop to false helps, could you please enable metrics (https://loki4j.github.io/loki-logback-appender/docs/performance), set drainOnStop to true and see if loki4j.unsent.events metric is 0 for a hanging process?
  3. If setting drainOnStop to false does not help, could you please elaborate more on your Linux vs Windows usage? You run the app on Linux as frequently as on Windows and see no issue on Linux, correct?
nehaev commented 1 year ago

Ready to re-open this issue again if any details on how to reproduce it are provided.