oolong-dev / OpenTelemetry.jl

An unofficial implementation of OpenTelemetry in Julia.
https://oolong.dev/OpenTelemetry.jl/
Apache License 2.0
30 stars 9 forks source link

Memory leak when using the `/metrics` endpoint (PrometheusExporter) #95

Closed krynju closed 4 months ago

krynju commented 1 year ago

EDIT: This is observable on 1.9, but not on 1.10

I've identified a memory leak in a long running service coming from OpenTelemetry.

I'm attaching an MWE split off from the main service, which can reproduce the issue identically The example attached is pretty extreme and can get you to ~2GB memory usage in about 10 minutes.

The leak is dependent on measure activity and capturing measures through the /metrics endpoint.

  1. metricsspam emulates a running service, which generates some metric activity
  2. for emulating metrics capture we use ab to call the /metrics endpoint extensively

Version info: julia 1.9.3, linux OpenTelemetry versions as the attached Manifest.toml or simpler, this repo at commit https://github.com/oolong-dev/OpenTelemetry.jl/commit/4975ecddf521f4fa930d4bc8964c0b8a91301039

subdir="src/api",                      # 0.3.0
subdir="src/sdk",                      # 0.2.1
subdir="src/proto",                    # 0.13.1
subdir="src/exporter/otlp/proto/grpc", # 0.2.1
subdir="src/exporter/prometheus",      # 0.1.4

Reproducer: memleak.zip

Tar contents:

  1. run.jl - reproducer code
  2. make.jl - code to generate project/manifest with exact package revisions used
  3. Project.toml/Manifest.toml

Steps:

  1. extract the zip
  2. run the reproducer using julia --project=. -e "include(\"run.jl\");init();"
  3. run the metrics capture emulation using ab -c 500 -n 1000000 http://localhost:9967/
  4. observe endlessly rising memory usage in the process monitoring tool of your choice

Example runtime (~4gb in 25min) image

lamdor commented 10 months ago

@krynju I think this may be an issue of the incremental GC not being able to keep up. I was trying to recreate this to see if I could fix it, but what I ended up trying was in the background having it do a full GC every few seconds. With that I don't see the memory to continue to grow, even after an hour. Before the GC.gc() change, the memory would grow by about 150MB/minute.

Here's my diff:

4c4
< using OpenTelemetryExporterPrometheus
---
> using OpenTelemetryExporterPrometheus
7a8,9
> GC.enable_logging(true)
>
172a175,182
> function gc_full_occasionally()
>     while true
>         sleep(2)
>         GC.gc()
>     end
>
> end
>
198c208,217
<
---
>         @async begin
>             try
>                 gc_full_occasionally()
>             catch ex
>                 @error(
>                     "exception initializing",
>                     exception = (ex, catch_backtrace())
>                 )
>             end
>         end
krynju commented 10 months ago

Notes from today:

  1. GC in loop at 2 sec period, ~540MB after 2h
  2. GC from interactive REPL called sometimes and extensively at the end - ~7.5GB after 2h
  3. GC in loop at 60 sec period, ~645MB after ~30min
krynju commented 9 months ago

Some experiments with heap snapshots.

After the 4th snapshot I ran GC manually and:

  1. heap actually had elements removed correctly (you can see the comparison of the 4th and 5th snapshot and there's a clear delta)
  2. process memory footprint stayed the same

I'm starting to think this is some hidden Julia issue. I would suspect the process memory footprint to go down along with the cleanup of the heap

image

process info

Heap dump says the heap is only 150mb, but process info says it's 1.5gb image

krynju commented 9 months ago

I ran the reproducer on 1.10 on a custom branch of OpenTelemetry.jl that has proper 1.10 support In two runs I could not reproduce the leak. Memory usage stays stable under load and GC seems to clean up most of it after I run it manually

This is potentially good news, but I'm not 100% sure of it yet. Will do more testing etc.

krynju commented 4 months ago

It's a julia 1.9 issue, closing this