loki4j / loki-logback-appender

Fast and lightweight implementation of Logback appender for Grafana Loki
https://loki4j.github.io/loki-logback-appender/
BSD 2-Clause "Simplified" License
314 stars 27 forks source link

Memory issue when labels change quite often (on every log request) #222

Closed ZIRAKrezovic closed 8 months ago

ZIRAKrezovic commented 8 months ago

I have been debugging a memory issue and pinned it finally to loki-logback-encoder. Lets start with my analysis

My memory heap dump (relevant part only)

image

Loki appender related in: java.util.String

image

LogStream holds the most entries for HashMap usage, mainly the labels which are different on every send

image

Final holder is the UnboundAtomicMapCache instantiated in AbstractLoki4jEncoder.

image

Due to nature of trace id being virtually different in every log message, cache miss is guaranteed and creation of new node on every new log request

image

This cache will grow indefinitely and never free unused labels, resulting in application eventually dying with out of memory error with any heap limit and a certain amount of logs.

Finally, my logback-spring.xml snippet

    <springProfile name="logLoki">
        <appender name="LOKI" class="com.github.loki4j.logback.Loki4jAppender">
            <http>
                <url>${lokiHost}</url>
                <tenantId>${lokiTenantId}</tenantId>
                <auth>
                    <username>${lokiUserName}</username>
                    <password>${lokiPassword}</password>
                </auth>
            </http>
            <format>
                <label>
                    <pattern>application=${appName},hostname=${HOSTNAME},traceId=%X{traceId:-NONE},level=%level,vendor=ZIRA</pattern>
                </label>
                <message>
                    <pattern>${FILE_LOG_PATTERN}</pattern>
                </message>
                <sortByTime>true</sortByTime>
            </format>
        </appender>

        <root level="INFO">
            <appender-ref ref="LOKI" />
        </root>
    </springProfile>

Now, I had the issue where loki would reject the logs due to label difference, so I had otel-exporter strip the traceId field and could probably drop it here. But still, the memory issue should be fixed or users should be aware of it, at least. That's why I am creating this (more of an FYI) report.

I am using loki-logback-appender 1.4.2 with spring-boot 3.2.3

nehaev commented 8 months ago

Hi @ZIRAKrezovic, thanks for reporting this! I guess you're right, even in extreme cases like this, the default cache implementation shouldn't lead to OOM. However, there is always a tradeoff as the current implementation is simple, performant, and covers all the sane use cases. I'll see what I can do to improve it.

As a workaround, you can check a setting format.label.streamCache that allows you to customize a cache implementation used for streams. You can either use advanced libs like Caffeine or omit caching completely.

Probably you know, but still, I have to notice, that your use case with too many unique streams goes strictly against Loki's best practices.

ZIRAKrezovic commented 8 months ago

Hi @nehaev, I am aware of my bad practice and have already fixed it. I still chose to report the bug in case somebody else runs into it. A simple(st) fix would be to WARN the user when cache reaches certain threshold (i.e. 100, 1000, ...).

nehaev commented 8 months ago

Some relevant experiments from the past: https://github.com/loki4j/loki-logback-appender/compare/main...limited-stream-cache