hrxi / tracing-loki

A tracing layer for shipping logs to Grafana Loki
Apache License 2.0
50 stars 22 forks source link

RAM usage #32

Open gabrik opened 11 hours ago

gabrik commented 11 hours ago

Hi, first thank you for this crate.

I've noticed an issue about memory usage when I push logs to loki using this crate.

I can see that the RAM continues to grow until the process gets killed for OOM.

Screenshot 2024-12-04 at 11 32 54

while when not using it I can see the RAM usage to be stable

Screenshot 2024-12-04 at 11 33 35

I can see the same behaviour using the Insturment application on macOS

Screenshot 2024-12-04 at 11 34 19

That's how I'm configuring it:

match (
        get_loki_endpoint(),
        get_loki_apikey(),
        get_loki_apikey_header(),
    ) {
        (Some(loki_url), Some(apikey), Some(header)) => {
            let builder = tracing_loki::builder();

            let builder = match metadata.as_object() {
                Some(metadata) => metadata
                    .into_iter()
                    .try_fold(builder, |builder, (k, v)| match v.as_str() {
                        Some(v) => builder.label(k, v),
                        None => Ok(builder),
                    }),
                None => Ok(builder),
            }?;

            let (loki_layer, task) = builder
                .http_header(header, apikey)?
                .build_url(Url::parse(&loki_url)?)?;

            tracing_sub.with(loki_layer).init();
            tokio::spawn(task);
            return Ok(());
        }
        _ => {
            tracing::warn!("Missing one of the required header for Loki!")
        }
    };
any suggestion?

if you need the Allocations trace file I can share it.
gabrik commented 10 hours ago

Maybe it comes from this line: https://github.com/hrxi/tracing-loki/blob/master/src/lib.rs#L513 has this queue seems just a couple of Vec<LokiEvent> and seems to just increase in size.

I see that here https://github.com/hrxi/tracing-loki/blob/master/src/lib.rs#L542 it should drop something, but given the memory usage it does not seem to happen

hrxi commented 9 hours ago

Permalinks (press 'y' anywhere on GitHub):

https://github.com/hrxi/tracing-loki/blob/aa0f6dbbabbd189d61a9f6da12dcee74262a8ae2/src/lib.rs#L513

https://github.com/hrxi/tracing-loki/blob/aa0f6dbbabbd189d61a9f6da12dcee74262a8ae2/src/lib.rs#L542

Can you tell me a bit about your setup? How many log messages do you log per second? Do any logs make it to Loki?

gabrik commented 9 hours ago

Logs do reach Loki, and there I can see as up to as 50logs per second.

Loki is on the cloud, latency is around 100ms to reach the server.

btw I was digging on the code and I do not understand why the backgroud task does not send the event directly instead of pushing them into two Vec and start a task per HTTP request.

Seem to batch some of those logs, but as a results makes the whole architecture quite complex.

hrxi commented 8 hours ago

btw I was digging on the code and I do not understand why the backgroud task does not send the event directly instead of pushing them into two Vec and start a task per HTTP request.

Seem to batch some of those logs, but as a results makes the whole architecture quite complex.

Yes, the idea is to have at most one outgoing HTTPS request at a time, and always push all available log messages when starting a new HTTPS request.

gabrik commented 8 hours ago

I tried to change the code to push the log line directly after the receiver.next() and the ram usage has reduced significatly.

From almost 100MB in 15m to 15MB in 15m and in general seems very stable.

I'm wondering if the BackgroudTask should run peridically, emptying the channel and send everything at once. There still be one outstanding request with all the events that where in the channel.

OFC the period should be configurable.

gabrik commented 8 hours ago

Doing some pseudo code.

loop {
   sleep(period);
   let events = receiver.drain();
   let http_request = make_request(events);
   http_client.send(http_request)
}

Still max one outstanding request, but memory should be bounded by the channel size (that should be also configurable).

gabrik commented 7 hours ago

@hrxi I made the changes in #33 we can discuss them there. But to me seems a nice improvement.