loki4j / loki-logback-appender

Fast and lightweight implementation of Logback appender for Grafana Loki
https://loki4j.github.io/loki-logback-appender/
BSD 2-Clause "Simplified" License
300 stars 26 forks source link

High idle CPU usage #128

Closed zodo closed 2 years ago

zodo commented 2 years ago

Greetings,

First of all, thanks for the great library!

I've encountered high idle CPU usage. This is especially noticeable when running multiple microservices (~15 instances) on the single AWS ec2 instance.

After disabling loki4j and enabling promtail idle load returned to normal

I'm not sure what can cause such behavior. Looking forward to your support

nehaev commented 2 years ago

Hi @zodo! Thanks for your report!

I've encountered high idle CPU usage.

Does it mean that at the moment when no log records are written the CPU usage is high?

I'll try to reproduce this issue locally. But in case I won't succeed, could I ask you to try to collect the profiling report on CPU usage? For example, you can use Async Profiler, it's pretty simple and non-intrusive and it should definitely help to find a root cause of this issue.

zodo commented 2 years ago

Thats correct, high CPU usage while no records being written

Here is the flame graph, I hope it would be sufficient, because I can't attach the jfr file

image

To make the things clearer - high CPU usage doesn't mean that it consumes all the available resources, no. The CPU usage was a slightly above baseline, but it was enough to make EC2 instance start to consume CPU credits

zodo commented 2 years ago

Found that I've run profiler with the 1.2.0 version

Here is the 1.3.0 version, results are the same

image

nehaev commented 2 years ago

Thanks for providing the profiling report on CPU usage! The root cause of this issue is this park timeout hardcoded to 1ms. This timeout defines the frequency that Loki4j checks its internal queues for new events to encode or send. Looks like 1ms cause too much thread context switching on idle.

There's a tradeoff between encode/send latency and CPU usage, but I guess it should be safe to increase its value to 25ms as a temporary fix. In future versions I'll consider to make a config property for this.

nehaev commented 2 years ago

Loki4j v1.3.1 was published, according to my measurements idle CPU usage should be lower. Feel free to re-open this issue if the fix didn't work.