open-telemetry / opentelemetry-go-contrib

Collection of extensions for OpenTelemetry-Go.
https://opentelemetry.io/
Apache License 2.0
1.2k stars 562 forks source link

Memory leak in Otel Http Handler caused by Counter and Histogram #2413

Closed joshuatroy closed 2 years ago

joshuatroy commented 2 years ago

Hi OpenTelemetry team 👋

We are using the go.opentelemetry.io/otel and go.opentelemetry.io/contrib packages at my work. They are part of our core library which manages the lifecycle of services (e.g. simplifies how we register HTTP/gRPC endpoints / how we setup RabbitMQ/Kafka consumers/how we deal with Liveness/Readiness probes / how we register and collect metrics etc).

This library is used by majority of our go micro services. In May, we upgraded our opentelemetry packges in the core library:

go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc v0.22.0 -> v0.31.0
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.22.0 -> v0.31.0
go.opentelemetry.io/contrib/propagators v0.22.0
go.opentelemetry.io/otel v1.5.0 -> v1.6.3
go.opentelemetry.io/otel/exporters/jaeger v1.5.0 -> v1.6.3
go.opentelemetry.io/otel/sdk v1.5.0 -> v1.6.3
go.opentelemetry.io/otel/trace v1.5.0 -> v1.6.3

We then released a new tagged version of our core library. As services started using the new version of our core library, we started to notice some memory leaks. The metrics driving the graphs below come from kubernetes, and specifically a services container. And we're seeing a similar leak across all services that have upgraded to our new core library (that uses the go.opentelemetry.io/otel and go.opentelemetry.io/contrib library). kubernetes-memory

So we used pprof to take a look at the services heap and generated a SVG below: service-example

We can see that go.opentelemetry.io/otel/metric/internal/global.(*siInstProvider).Counter has the largest flat and cumulative memory. otel-leak

We can also confirm this using pprof cli (and I've just replaced company git and service with some example names).

(pprof) top20 -cum
Showing nodes accounting for 171.62MB, 83.40% of 205.78MB total
Dropped 192 nodes (cum <= 1.03MB)
Showing top 20 nodes out of 80
      flat  flat%   sum%        cum   cum%
         0     0%     0%   177.70MB 86.35%  net/http.HandlerFunc.ServeHTTP
         0     0%     0%   177.01MB 86.02%  gopkg.in/DataDog/dd-trace-go.v1/contrib/gorilla/mux.(*Router).ServeHTTP
         0     0%     0%   177.01MB 86.02%  gopkg.in/DataDog/dd-trace-go.v1/contrib/net/http.TraceAndServe
         0     0%     0%   177.01MB 86.02%  net/http.(*conn).serve
         0     0%     0%   177.01MB 86.02%  net/http.serverHandler.ServeHTTP
         0     0%     0%   176.67MB 85.85%  github.com/gorilla/mux.(*Router).ServeHTTP
         0     0%     0%   176.20MB 85.62%  example.company.domain/go/core-library/listeners/httplistener.ContextMiddleware.func1.1
         0     0%     0%   176.20MB 85.62%  example.company.domain/go/core-library/listeners/httplistener.OtelMiddleware.func1
         0     0%     0%   175.70MB 85.38%  example.company.domain/go/core-library/listeners/httplistener.APIMiddleware.func1.1
         0     0%     0%   171.62MB 83.40%  go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*Handler).createMeasures
         0     0%     0%   171.62MB 83.40%  go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewHandler
  127.12MB 61.77% 61.77%   127.12MB 61.77%  go.opentelemetry.io/otel/metric/internal/global.(*siInstProvider).Counter
   44.50MB 21.63% 83.40%    44.50MB 21.63%  go.opentelemetry.io/otel/metric/internal/global.(*sfInstProvider).Histogram
         0     0% 83.40%    32.20MB 15.65%  example.company.domain/go/example-service/internal/transport/http.(*handler).loggingMiddleware.func1
         0     0% 83.40%    31.20MB 15.16%  example.company.domain/go/example-service/internal/transport/http.(*handler).securityMiddleware.func1
         0     0% 83.40%     9.77MB  4.75%  runtime.main
         0     0% 83.40%     7.18MB  3.49%  gopkg.in/launchdarkly/go-server-sdk-evaluation.v1/ldmodel.UnmarshalSegmentFromJSONReader (inline)
         0     0% 83.40%     7.18MB  3.49%  gopkg.in/launchdarkly/go-server-sdk-evaluation.v1/ldmodel.unmarshalSegmentFromReader
         0     0% 83.40%     7.18MB  3.49%  gopkg.in/launchdarkly/go-server-sdk.v5/internal/datakinds.segmentStoreDataKind.DeserializeFromJSONReader
         0     0% 83.40%     7.18MB  3.49%  gopkg.in/launchdarkly/go-server-sdk.v5/internal/datasource.(*StreamProcessor).consumeStream

(pprof) top20 -flat
Active filters:
   ignore=flat
Showing nodes accounting for 188.40MB, 91.55% of 205.78MB total
Dropped 194 nodes (cum <= 1.03MB)
Showing top 20 nodes out of 62
      flat  flat%   sum%        cum   cum%
  127.12MB 61.77% 61.77%   127.12MB 61.77%  go.opentelemetry.io/otel/metric/internal/global.(*siInstProvider).Counter
   44.50MB 21.63% 83.40%    44.50MB 21.63%  go.opentelemetry.io/otel/metric/internal/global.(*sfInstProvider).Histogram
    5.01MB  2.43% 85.83%     5.01MB  2.43%  github.com/DataDog/datadog-go/v5/statsd.newStatsdBuffer (inline)
       4MB  1.95% 87.78%        4MB  1.95%  runtime.allocm
    3.97MB  1.93% 89.71%     3.97MB  1.93%  gopkg.in/launchdarkly/go-server-sdk-evaluation.v1/ldmodel.PreprocessSegment
    2.50MB  1.21% 90.92%     2.50MB  1.21%  gopkg.in/launchdarkly/go-jsonstream.v1/jreader.(*tokenReader).String (inline)
    0.71MB  0.35% 91.27%     3.21MB  1.56%  gopkg.in/launchdarkly/go-server-sdk-evaluation.v1/ldmodel.readStringList
    0.59MB  0.29% 91.55%     1.09MB  0.53%  github.com/go-playground/validator/v10.init
         0     0% 91.55%     5.55MB  2.70%  example.company.domain/go/core-library.(*Service).run
         0     0% 91.55%     1.08MB  0.53%  example.company.domain/go/core-library.(*Service).setupListeners.func1
         0     0% 91.55%     1.08MB  0.53%  example.company.domain/go/core-library.(*Service).setupListeners.func1.1
         0     0% 91.55%     5.55MB  2.70%  example.company.domain/go/core-library.Run
         0     0% 91.55%   173.12MB 84.13%  example.company.domain/go/core-library/listeners/httplistener.APIMiddleware.func1.1
         0     0% 91.55%   173.62MB 84.37%  example.company.domain/go/core-library/listeners/httplistener.ContextMiddleware.func1.1
         0     0% 91.55%   173.62MB 84.37%  example.company.domain/go/core-library/listeners/httplistener.OtelMiddleware.func1
         0     0% 91.55%    30.51MB 14.82%  example.company.domain/go/example-service/internal/transport/http.(*handler).loggingMiddleware.func1
         0     0% 91.55%    29.50MB 14.34%  example.company.domain/go/example-service/internal/transport/http.(*handler).securityMiddleware.func1
         0     0% 91.55%     5.01MB  2.43%  github.com/DataDog/datadog-go/v5/statsd.(*bufferPool).addNewBuffer (inline)
         0     0% 91.55%     5.01MB  2.43%  github.com/DataDog/datadog-go/v5/statsd.New
         0     0% 91.55%     5.01MB  2.43%  github.com/DataDog/datadog-go/v5/statsd.newBufferPool (inline)

So the createMeasures function call lead me here, but as I'm creating this issue I'm wondering whether it should be raised against this repository or the go.opentelemetry.io/otel/metric one.

We have also tried upgrading to a later versions of these libraries but the issue is still there.

go.opentelemetry.io/otel v1.7.0
go.opentelemetry.io/contrib v1.7.0

What version of Go and opentelemetry-go-contrib are you using? Go 1.17 opentelemetry-go-contrib v1.7.0

What operating system and processor architecture are you using? OS: Linux Architecture: x86_64

What did you do? Upgraded go.opentelemetry.io libraries

What did you expect to see? No memory changes

What did you see instead? Memory leak

fatsheep9146 commented 2 years ago

Could you show the code you use the otlphttp instrumentation library? I tried to use the example code in

https://github.com/open-telemetry/opentelemetry-go-contrib/tree/instrumentation/net/http/otelhttp/v0.31.0/instrumentation/net/http/otelhttp/example

to start a simple server and client, and mock 10000 requests sent to server, the memory usage is not changing so fast like your case.

And the most usage of memory is not global.(*siInstProvider).Counter and global.(*siInstProvider).Histogram

joshuatroy commented 2 years ago

Hey @fatsheep9146 sorry for my delay in responding.

The problem was in fact how we were using the library. A new otelhttp handler was being created on every request inside some middleware, instead of being created once and re-used!