open-telemetry / opentelemetry-collector-releases

OpenTelemetry Collector Official Releases
https://opentelemetry.io
Apache License 2.0
242 stars 158 forks source link

High CPU utlization on windows server after configuring the otelcol #603

Open desududev opened 3 months ago

desududev commented 3 months ago

High CPU utlization on windows server after configuring the otelcol

dosubot[bot] commented 3 months ago

Hey there, @desududev! I'm a bot designed to assist you with bugs, questions, and contributions. Feel free to ask for help while waiting for a human maintainer to join the conversation. 🤖

To troubleshoot and resolve high CPU utilization on a Windows server after configuring the OpenTelemetry Collector (otelcol), you can follow these steps:

  1. Enable Internal Telemetry: Configure the Collector’s internal telemetry to monitor its performance. This will help you understand the resource usage and identify any bottlenecks.

  2. Use the pprof Extension: Enable the pprof extension to profile the Collector as it runs. This can help you identify which parts of the Collector are consuming the most CPU.

    extensions:
      pprof:
        endpoint: 0.0.0.0:1777
  3. Check for Configuration Issues: Ensure that the Collector is properly configured. Misconfigurations can lead to high CPU usage. For example, make sure that the memory_limiter processor is configured correctly to avoid memory pressure.

    processors:
      memory_limiter:
        check_interval: 1s
        limit_mib: 4000
        spike_limit_mib: 500
        ballast_size_mib: 1000
  4. Inspect Logs and zPages: Check the Collector’s logs and use the zPages extension to inspect live data from the Collector’s receivers and exporters. This can help you identify any errors or performance issues.

    extensions:
      zpages:
        endpoint: 0.0.0.0:55679
  5. Use the debug Exporter: Configure the debug exporter to confirm that the Collector is receiving, processing, and exporting data correctly. This can help you identify any issues in the data flow that might be causing high CPU usage.

    receivers:
      zipkin:
    exporters:
      debug:
    service:
      pipelines:
        traces:
          receivers: [zipkin]
          processors: []
          exporters: [debug]

    To begin testing, generate a Zipkin payload. For example, you can create a file called trace.json that contains:

    [
      {
        "traceId": "5982fe77008310cc80f1da5e10147519",
        "parentId": "90394f6bcffb5d13",
        "id": "67fae42571535f60",
        "kind": "SERVER",
        "name": "/m/n/2.6.1",
        "timestamp": 1516781775726000,
        "duration": 26000,
        "localEndpoint": {
          "serviceName": "api"
        },
        "remoteEndpoint": {
          "serviceName": "apip"
        },
        "tags": {
          "data.http_response_code": "201"
        }
      }
    ]

    With the Collector running, send this payload to the Collector:

    curl -X POST localhost:9411/api/v2/spans -H 'Content-Type: application/json' -d @trace.json

    You should see a log entry like the following:

    2023-09-07T09:57:43.468-0700    info    TracesExporter  {"kind": "exporter", "data_type": "traces", "name": "debug", "resource spans": 1, "spans": 2}

    You can also configure the debug exporter so the entire payload is printed:

    exporters:
      debug:
        verbosity: detailed

    If you re-run the previous test with the modified configuration, the log output looks like this:

    2023-09-07T09:57:12.820-0700    info    TracesExporter  {"kind": "exporter", "data_type": "traces", "name": "debug", "resource spans": 1, "spans": 2}
    2023-09-07T09:57:12.821-0700    info    ResourceSpans #0
    Resource SchemaURL: https://opentelemetry.io/schemas/1.4.0
    Resource attributes:
         -> service.name: Str(telemetrygen)
    ScopeSpans #0
    ScopeSpans SchemaURL:
    InstrumentationScope telemetrygen
    Span #0
        Trace ID       : 0c636f29e29816ea76e6a5b8cd6601cf
        Parent ID      : 1a08eba9395c5243
        ID             : 10cebe4b63d47cae
        Name           : okey-dokey
        Kind           : Internal
        Start time     : 2023-09-07 16:57:12.045933 +0000 UTC
        End time       : 2023-09-07 16:57:12.046058 +0000 UTC
        Status code    : Unset
        Status message :
    Attributes:
         -> span.kind: Str(server)
         -> net.peer.ip: Str(1.2.3.4)
         -> peer.service: Str(telemetrygen)

By following these steps, you should be able to identify and resolve the cause of high CPU utilization on your Windows server after configuring the OpenTelemetry Collector [1][2].

To continue talking to Dosu, mention @dosu.

jpkrohling commented 3 months ago

Good bot, especially the first three steps.