open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.9k stars 2.27k forks source link

Understanding high memory usage of otelcol (prometheusreceiver / prometheusexporter) #9998

Closed Mario-Hofstaetter closed 2 years ago

Mario-Hofstaetter commented 2 years ago

Describe the bug

I am infrequently observing high memory usage of otelcol.exe on Windows Systems. The memory is however within the set bounds of memory_limiter (see config below). But I am not sure if this memory usage is intended behavior, since there is negligible load on the systems. I am not experienced in golang so please excuse my lack of knowledge.

My planned actions to counteract this behavior is reduce the configured limit_mib and fix the memory_ballast.

Steps to reproduce

?? Full config see below.

What did you expect to see?

otelcol should not consume "much" memory given the workload is little, and/or memory should eventually be garbage collected?

What did you see instead?

otelcol process uses (?) up to 2GB of memory. Currently on my local machine here it sits at ~ 1.5GB. Last stdout log message from memory_limiter after machine was started from hibernation:

{"level":"info","ts":1652033885.9450645,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":197}
{"level":"info","ts":1652033885.8394392,"caller":"memorylimiterprocessor/memorylimiter.go:310","msg":"Memory usage is above soft limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":1623}

Is this the intended behavior of memory usage of the process?

What are recommended values for memory_limiter ? The documentation is a little vague.

On my machine, the prometheus exporter currently emits ~ 19362 lines of metrics, less than 3 MB in size. On our biggest instance, the prometheus exporter has 37654 metric lines , ~ 8MB. Is this a lot?

The documentation on that page uses limit_mib: 4000 , which seems kinda HUGE for this kind of application?

After re-reading those docs just now, this line caught my attension:

limit_mib (default = 0): Maximum amount of memory, in MiB, targeted to be allocated by the process heap. Note that typically the total memory usage of process will be about 50MiB higher than this value. This defines the hard limit.

So it actually is the expected behavor of otelcol to stay around limit_mib indefinetely?

My trace queue size is currently at zero:

# HELP otelcol_exporter_queue_size Current size of the retry queue (in batches)
# TYPE otelcol_exporter_queue_size gauge
otelcol_exporter_queue_size{exporter="jaeger",service_instance_id="fc4ff71e-1a39-4b13-9990-db1ec97e9c9b",service_version="latest"} 0

I had the suspicion the trace buffer was filling memory when VPN was disconnected at jaeger server not reachable, since the default queue size of 5000 (?) seemed rather high.

Looking at logs from last night, it has been the traces responsible for causing the memory increase:

2022-05-07T01:22:01+02:00   {"level":"info","ts":1651879321.2295015,"caller":"memorylimiterprocessor/memorylimiter.go:303","msg":"Memory usage back within limits. Resuming normal operation.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":182}
2022-05-07T01:22:01+02:00   {"level":"info","ts":1651879321.2295015,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":182}
2022-05-07T01:20:19+02:00   {"level":"info","ts":1651879219.4642153,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":1665}
2022-05-07T01:18:37+02:00   {"level":"info","ts":1651879117.4491055,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":1661}
2022-05-07T01:16:55+02:00   {"level":"info","ts":1651879015.4828515,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":1664}
2022-05-07T01:15:30+02:00   {"level":"info","ts":1651878930.5143886,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":1683}
2022-05-07T01:14:05+02:00   {"level":"info","ts":1651878845.4196262,"caller":"memorylimiterprocessor/memorylimiter.go:281","msg":"Memory usage after GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":1683}
2022-05-07T01:22:01+02:00   {"level":"warn","ts":1651879321.100273,"caller":"memorylimiterprocessor/memorylimiter.go:291","msg":"Memory usage is above hard limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":2059}
2022-05-07T01:20:19+02:00   {"level":"warn","ts":1651879219.098343,"caller":"memorylimiterprocessor/memorylimiter.go:291","msg":"Memory usage is above hard limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":2049}
2022-05-07T01:18:37+02:00   {"level":"warn","ts":1651879117.0985374,"caller":"memorylimiterprocessor/memorylimiter.go:291","msg":"Memory usage is above hard limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":2048}
2022-05-07T01:16:55+02:00   {"level":"warn","ts":1651879015.103772,"caller":"memorylimiterprocessor/memorylimiter.go:291","msg":"Memory usage is above hard limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":2012}
2022-05-07T01:15:30+02:00   {"level":"warn","ts":1651878930.103912,"caller":"memorylimiterprocessor/memorylimiter.go:291","msg":"Memory usage is above hard limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":2005}
2022-05-07T01:14:05+02:00   {"level":"warn","ts":1651878845.099606,"caller":"memorylimiterprocessor/memorylimiter.go:291","msg":"Memory usage is above hard limit. Forcing a GC.","kind":"processor","name":"memory_limiter","pipeline":"traces","cur_mem_mib":2061}

If the sending_queue queue_size of 5000 does not fit within memory limits, what is going to happen? Oldest trace spans are going to get dropped from queue?

I did however also have on machine where otelcol used 2GB of memory suddenly, and no traces where being queued. No warnings at the time, unfortunately I have not info logs from that date (otelcol was still windows application event logging)

grafik

There are no apps emitting traces on that machine yet, so no Idea what has happend there.

I have collected and attached various debug information from http://localhost:1777/debug/pprof/, in case the memory usage is not okay.

What version did you use?

.\otelcol.exe --version
otelcol version **0.48.0**

What config did you use?

I am using two config files (currently 3 two debug) consisting of the following parts:

ConfigFile 1 (common and metrics):

exporters:
  prometheus:
    endpoint: 0.0.0.0:7299
    metric_expiration: 5m  # default = 5m
    send_timestamps: true

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: localmetrics
          scrape_interval: 17s

          tls_config:
            insecure_skip_verify: true

          static_configs:
            - targets: [localhost:8888] # Self diagnostic metrics of otelcol
              labels:
                from_app: otelcol
          file_sd_configs:
            - files:
              - "C:/Program Files/OpenTelemetry/OTEL Collector/metric-targets/*.yaml"

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 2000
  batch:

extensions:
  health_check:
    endpoint: localhost:13133
  zpages:
    endpoint: localhost:42424
  memory_ballast:
    size_mib: 256

service:
  extensions: [health_check , zpages]
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [memory_limiter , batch]
      exporters: [prometheus]

  # Otel-Collector Self-Diagnostics
  telemetry:
    logs:
      level: info
      encoding: json
      output_paths:       ["stdout"]
      error_output_paths: ["stderr"]
    metrics:
      address: localhost:8888

I just noticed an error in my config... the memory_ballast extension is not used it seems..

Second File used optionally (if traces are required):

exporters:
  jaeger:
    endpoint: "${AX_MONITORING_SERVER}:14250"
    tls:
      insecure: true

receivers:
  jaeger:
    protocols:
      thrift_compact:
        endpoint: localhost:6831
  otlp:
    protocols:
      grpc:
        endpoint: localhost:4317
      http:
        endpoint: localhost:4318

service:
  pipelines:
    traces:
      receivers: [otlp , jaeger]
      processors: [memory_limiter , batch]
      exporters: [jaeger]

Third config file currently used to add pprof:

extensions:
  pprof:
    endpoint: "127.0.0.1:1777"
    block_profile_fraction: 3
    mutex_profile_fraction: 5
service:
  extensions: [pprof,health_check,zpages]

Environment OS: Windows 10 21H2, Windows Server 2019