open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.97k stars 2.3k forks source link

panic: runtime error: slice bounds out of range with v0.82.0 #24908

Closed stephenhong closed 1 year ago

stephenhong commented 1 year ago

Component(s)

No response

What happened?

Description

The collector is shutting down and restarting repeatedly due to the following error

2023-08-04T13:20:30.557-0400    info    MetricsExporter {"kind": "exporter", "data_type": "metrics", "name": "logging", "resource metrics": 17, "metrics": 166, "data points": 314}
panic: runtime error: slice bounds out of range [-63:] [recovered]
    panic: runtime error: slice bounds out of range [-63:]
goroutine 215 [running]:
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End.func1()
    go.opentelemetry.io/otel/sdk@v1.16.0/trace/span.go:383 +0x2a
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End(0xc000b7b080, {0x0, 0x0, 0xc001aae8ca?})
    go.opentelemetry.io/otel/sdk@v1.16.0/trace/span.go:421 +0xa29
panic({0x6e8c620, 0xc00063f9e0})
    runtime/panic.go:884 +0x213
go.opentelemetry.io/collector/pdata/internal/data/protogen/metrics/v1.(*Metric).MarshalToSizedBuffer(0xc000650140, {0xc001aae000, 0x202, 0xc071})
    go.opentelemetry.io/collector/pdata@v1.0.0-rcv0014/internal/data/protogen/metrics/v1/metrics.pb.go:2246 +0x45c
go.opentelemetry.io/collector/pdata/internal/data/protogen/metrics/v1.(*ScopeMetrics).MarshalToSizedBuffer(0xc001086380, {0xc001aae000, 0xdbb, 0xc071})
    go.opentelemetry.io/collector/pdata@v1.0.0-rcv0014/internal/data/protogen/metrics/v1/metrics.pb.go:2198 +0x23c
go.opentelemetry.io/collector/pdata/internal/data/protogen/metrics/v1.(*ResourceMetrics).MarshalToSizedBuffer(0xc000a96060, {0xc001aae000, 0xde3, 0xc071})
    go.opentelemetry.io/collector/pdata@v1.0.0-rcv0014/internal/data/protogen/metrics/v1/metrics.pb.go:2144 +0x25c
go.opentelemetry.io/collector/pdata/internal/data/protogen/collector/metrics/v1.(*ExportMetricsServiceRequest).MarshalToSizedBuffer(0xc001c2fd40, {0xc001aae000, 0xc071, 0xc071})
    go.opentelemetry.io/collector/pdata@v1.0.0-rcv0014/internal/data/protogen/collector/metrics/v1/metrics_service.pb.go:352 +0xac
go.opentelemetry.io/collector/pdata/internal/data/protogen/collector/metrics/v1.(*ExportMetricsServiceRequest).Marshal(0xc0000ec400?)
    go.opentelemetry.io/collector/pdata@v1.0.0-rcv0014/internal/data/protogen/collector/metrics/v1/metrics_service.pb.go:332 +0x56
google.golang.org/protobuf/internal/impl.legacyMarshal({{}, {0x826d708, 0xc0006e2140}, {0x0, 0x0, 0x0}, 0x0})
    google.golang.org/protobuf@v1.31.0/internal/impl/legacy_message.go:402 +0xa2
google.golang.org/protobuf/proto.MarshalOptions.marshal({{}, 0xc0?, 0x0, 0x0}, {0x0, 0x0, 0x0}, {0x826d708, 0xc0006e2140})
    google.golang.org/protobuf@v1.31.0/proto/encode.go:166 +0x27b
google.golang.org/protobuf/proto.MarshalOptions.MarshalAppend({{}, 0x40?, 0x82?, 0xb?}, {0x0, 0x0, 0x0}, {0x81de340?, 0xc0006e2140?})
    google.golang.org/protobuf@v1.31.0/proto/encode.go:125 +0x79
github.com/golang/protobuf/proto.marshalAppend({0x0, 0x0, 0x0}, {0x7f3cdfde93e8?, 0xc001c2fd40?}, 0x70?)
    github.com/golang/protobuf@v1.5.3/proto/wire.go:40 +0xa5
github.com/golang/protobuf/proto.Marshal(...)
    github.com/golang/protobuf@v1.5.3/proto/wire.go:23
google.golang.org/grpc/encoding/proto.codec.Marshal({}, {0x70b8240, 0xc001c2fd40})
    google.golang.org/grpc@v1.57.0/encoding/proto/proto.go:45 +0x4e
google.golang.org/grpc.encode({0x7f3cdfde9378?, 0xc74f210?}, {0x70b8240?, 0xc001c2fd40?})
    google.golang.org/grpc@v1.57.0/rpc_util.go:633 +0x44
google.golang.org/grpc.prepareMsg({0x70b8240?, 0xc001c2fd40?}, {0x7f3cdfde9378?, 0xc74f210?}, {0x0, 0x0}, {0x82248b0, 0xc0001bc0a0})
    google.golang.org/grpc@v1.57.0/stream.go:1766 +0xd2
google.golang.org/grpc.(*clientStream).SendMsg(0xc0006f6480, {0x70b8240?, 0xc001c2fd40})
    google.golang.org/grpc@v1.57.0/stream.go:882 +0xfd
google.golang.org/grpc.invoke({0x8234bd8?, 0xc0009ec2d0?}, {0x7636bed?, 0x4?}, {0x70b8240, 0xc001c2fd40}, {0x70b8380, 0xc000010540}, 0x0?, {0xc000886060, ...})
    google.golang.org/grpc@v1.57.0/call.go:75 +0xa8
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryClientInterceptor.func1({0x8234bd8, 0xc0009ec210}, {0x7636bed, 0x3f}, {0x70b8240, 0xc001c2fd40}, {0x70b8380, 0xc000010540}, 0xc001d6a000, 0x77f86f8, ...)
    go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.42.1-0.20230612162650-64be7e574a17/interceptor.go:100 +0x3e4
google.golang.org/grpc.(*ClientConn).Invoke(0xc001d6a000, {0x8234bd8, 0xc0009ec210}, {0x7636bed, 0x3f}, {0x70b8240, 0xc001c2fd40}, {0x70b8380, 0xc000010540}, {0xc000abb390, ...})
    google.golang.org/grpc@v1.57.0/call.go:40 +0x24d
go.opentelemetry.io/collector/pdata/internal/data/protogen/collector/metrics/v1.(*metricsServiceClient).Export(0xc000915098, {0x8234bd8, 0xc0009ec210}, 0xc0010c5770?, {0xc000abb390, 0x1, 0x1})
    go.opentelemetry.io/collector/pdata@v1.0.0-rcv0014/internal/data/protogen/collector/metrics/v1/metrics_service.pb.go:272 +0xc9
go.opentelemetry.io/collector/pdata/pmetric/pmetricotlp.(*grpcClient).Export(0x8234bd8?, {0x8234bd8?, 0xc0009ec210?}, {0xc0009ec1e0?}, {0xc000abb390?, 0xc000fca180?, 0x2?})
    go.opentelemetry.io/collector/pdata@v1.0.0-rcv0014/pmetric/pmetricotlp/grpc.go:41 +0x30
go.opentelemetry.io/collector/exporter/otlpexporter.(*baseExporter).pushMetrics(0xc000970580, {0x8234ba0?, 0xc0009ec1e0?}, {0x8234bd8?})
    go.opentelemetry.io/collector/exporter/otlpexporter@v0.82.0/otlp.go:107 +0x87
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsRequest).Export(0x8234bd8?, {0x8234ba0?, 0xc0009ec1e0?})
    go.opentelemetry.io/collector/exporter@v0.82.0/exporterhelper/metrics.go:54 +0x34
go.opentelemetry.io/collector/exporter/exporterhelper.(*timeoutSender).send(0xc000af4708, {0x8257508, 0xc000a054a0})
    go.opentelemetry.io/collector/exporter@v0.82.0/exporterhelper/common.go:197 +0x96
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send(0xc000b36e60, {0x8257508, 0xc000a054a0})
    go.opentelemetry.io/collector/exporter@v0.82.0/exporterhelper/queued_retry.go:384 +0x596
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send(0xc00094ec00, {0x8257508, 0xc000a054a0})
    go.opentelemetry.io/collector/exporter@v0.82.0/exporterhelper/metrics.go:125 +0x88
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1({0x8257508, 0xc000a054a0})
    go.opentelemetry.io/collector/exporter@v0.82.0/exporterhelper/queued_retry.go:195 +0x39
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1()
    go.opentelemetry.io/collector/exporter@v0.82.0/exporterhelper/internal/bounded_memory_queue.go:47 +0xb6
created by go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers
    go.opentelemetry.io/collector/exporter@v0.82.0/exporterhelper/internal/bounded_memory_queue.go:42 +0x45

Steps to Reproduce

Run Otel collector v0.82.0 with the below config.yaml There are multiple apps sending trace, metrics, and logs to this collector but not sure what exact data is causing this

Expected Result

No panic: runtime error

Actual Result

Got the above error

Collector version

v0.82.0

Environment information

Environment

OS: AmazonLinux2

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:8443"
  prometheus:
    config:
      scrape_configs:
        - job_name: '$NR_ACCOUNT_NAME/otel-self-metrics-gateway-$aws_region'
          scrape_interval: 1m
          static_configs:
            - targets: [ '0.0.0.0:9999' ]

exporters:
  logging:
    verbosity: normal
  splunk_hec:
    # Splunk HTTP Event Collector token.
    token: $SPLUNK_TOKEN
    # URL to a Splunk instance to send data to.
    endpoint: $SPLUNK_ENDPOINT
    # Optional Splunk source: https://docs.splunk.com/Splexicon:Source
    source: "otel"
    # Optional Splunk source type: https://docs.splunk.com/Splexicon:Sourcetype
    sourcetype: "otel"
    # Splunk index, optional name of the Splunk index targeted.
    index: $SPLUNK_INDEX
    # Maximum HTTP connections to use simultaneously when sending data. Defaults to 100.
    max_connections: 200
    # Whether to disable gzip compression over HTTP. Defaults to false.
    disable_compression: false
    # HTTP timeout when sending data. Defaults to 10s.
    timeout: 10s
  otlp:
    endpoint: $OTLP_ENDPOINT
    headers:
      api-key: $NR_API_KEY
    compression: gzip
  datadog:
    api:
      site: datadoghq.com
      key: $DD_API_KEY

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 65
    spike_limit_percentage: 10
  batch:
    send_batch_size: 4096
    send_batch_max_size: 4096
  filter:
    metrics:
      exclude:
        match_type: strict
        metric_names:
          # comment a metric to remove from exclusion rule
          - otelcol_exporter_queue_capacity
          # - otelcol_exporter_queue_size
          - otelcol_exporter_enqueue_failed_spans
          - otelcol_exporter_enqueue_failed_log_records
          - otelcol_exporter_enqueue_failed_metric_points
          # - otelcol_exporter_sent_metric_points
          - otelcol_exporter_send_failed_metric_points
          # - otelcol_exporter_sent_spans
          - otelcol_process_runtime_heap_alloc_bytes
          - otelcol_process_runtime_total_alloc_bytes
          - otelcol_processor_batch_timeout_trigger_send
          # - otelcol_process_memory_rss
          - otelcol_process_runtime_total_sys_memory_bytes
          # - otelcol_process_cpu_seconds
          - otelcol_process_uptime
          # - otelcol_receiver_accepted_metric_points
          # - otelcol_receiver_refused_metric_points
          # - otelcol_receiver_accepted_spans
          # - otelcol_receiver_refused_spans
          - otelcol_scraper_errored_metric_points
          - otelcol_scraper_scraped_metric_points
          - scrape_samples_scraped
          - scrape_samples_post_metric_relabeling
          - scrape_series_added
          - scrape_duration_seconds
          # - up

extensions:
  health_check:
    endpoint: "0.0.0.0:8080"
  pprof:
  zpages:
    endpoint: "0.0.0.0:11400"

service:
  extensions: [pprof, zpages, health_check]
  telemetry:
    metrics:
      level: detailed
      address: 0.0.0.0:9999
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [logging, otlp, datadog]
      processors: [memory_limiter, batch]
    metrics:
      receivers: [otlp, prometheus]
      exporters: [logging, otlp, datadog]
      processors: [memory_limiter, batch, filter]
    logs:
      receivers: [otlp]
      exporters: [logging, splunk_hec]
      processors: [memory_limiter, batch]

Log output

2023-08-04T13:20:30.557-0400    info    MetricsExporter {"kind": "exporter", "data_type": "metrics", "name": "logging", "resource metrics": 17, "metrics": 166, "data points": 314}
panic: runtime error: slice bounds out of range [-63:] [recovered]
    panic: runtime error: slice bounds out of range [-63:]
goroutine 215 [running]:
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End.func1()
    go.opentelemetry.io/otel/sdk@v1.16.0/trace/span.go:383 +0x2a
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End(0xc000b7b080, {0x0, 0x0, 0xc001aae8ca?})
    go.opentelemetry.io/otel/sdk@v1.16.0/trace/span.go:421 +0xa29
panic({0x6e8c620, 0xc00063f9e0})
    runtime/panic.go:884 +0x213
go.opentelemetry.io/collector/pdata/internal/data/protogen/metrics/v1.(*Metric).MarshalToSizedBuffer(0xc000650140, {0xc001aae000, 0x202, 0xc071})
    go.opentelemetry.io/collector/pdata@v1.0.0-rcv0014/internal/data/protogen/metrics/v1/metrics.pb.go:2246 +0x45c
go.opentelemetry.io/collector/pdata/internal/data/protogen/metrics/v1.(*ScopeMetrics).MarshalToSizedBuffer(0xc001086380, {0xc001aae000, 0xdbb, 0xc071})
    go.opentelemetry.io/collector/pdata@v1.0.0-rcv0014/internal/data/protogen/metrics/v1/metrics.pb.go:2198 +0x23c
go.opentelemetry.io/collector/pdata/internal/data/protogen/metrics/v1.(*ResourceMetrics).MarshalToSizedBuffer(0xc000a96060, {0xc001aae000, 0xde3, 0xc071})
    go.opentelemetry.io/collector/pdata@v1.0.0-rcv0014/internal/data/protogen/metrics/v1/metrics.pb.go:2144 +0x25c
go.opentelemetry.io/collector/pdata/internal/data/protogen/collector/metrics/v1.(*ExportMetricsServiceRequest).MarshalToSizedBuffer(0xc001c2fd40, {0xc001aae000, 0xc071, 0xc071})
    go.opentelemetry.io/collector/pdata@v1.0.0-rcv0014/internal/data/protogen/collector/metrics/v1/metrics_service.pb.go:352 +0xac
go.opentelemetry.io/collector/pdata/internal/data/protogen/collector/metrics/v1.(*ExportMetricsServiceRequest).Marshal(0xc0000ec400?)
    go.opentelemetry.io/collector/pdata@v1.0.0-rcv0014/internal/data/protogen/collector/metrics/v1/metrics_service.pb.go:332 +0x56
google.golang.org/protobuf/internal/impl.legacyMarshal({{}, {0x826d708, 0xc0006e2140}, {0x0, 0x0, 0x0}, 0x0})
    google.golang.org/protobuf@v1.31.0/internal/impl/legacy_message.go:402 +0xa2
google.golang.org/protobuf/proto.MarshalOptions.marshal({{}, 0xc0?, 0x0, 0x0}, {0x0, 0x0, 0x0}, {0x826d708, 0xc0006e2140})
    google.golang.org/protobuf@v1.31.0/proto/encode.go:166 +0x27b
google.golang.org/protobuf/proto.MarshalOptions.MarshalAppend({{}, 0x40?, 0x82?, 0xb?}, {0x0, 0x0, 0x0}, {0x81de340?, 0xc0006e2140?})
    google.golang.org/protobuf@v1.31.0/proto/encode.go:125 +0x79
github.com/golang/protobuf/proto.marshalAppend({0x0, 0x0, 0x0}, {0x7f3cdfde93e8?, 0xc001c2fd40?}, 0x70?)
    github.com/golang/protobuf@v1.5.3/proto/wire.go:40 +0xa5
github.com/golang/protobuf/proto.Marshal(...)
    github.com/golang/protobuf@v1.5.3/proto/wire.go:23
google.golang.org/grpc/encoding/proto.codec.Marshal({}, {0x70b8240, 0xc001c2fd40})
    google.golang.org/grpc@v1.57.0/encoding/proto/proto.go:45 +0x4e
google.golang.org/grpc.encode({0x7f3cdfde9378?, 0xc74f210?}, {0x70b8240?, 0xc001c2fd40?})
    google.golang.org/grpc@v1.57.0/rpc_util.go:633 +0x44
google.golang.org/grpc.prepareMsg({0x70b8240?, 0xc001c2fd40?}, {0x7f3cdfde9378?, 0xc74f210?}, {0x0, 0x0}, {0x82248b0, 0xc0001bc0a0})
    google.golang.org/grpc@v1.57.0/stream.go:1766 +0xd2
google.golang.org/grpc.(*clientStream).SendMsg(0xc0006f6480, {0x70b8240?, 0xc001c2fd40})
    google.golang.org/grpc@v1.57.0/stream.go:882 +0xfd
google.golang.org/grpc.invoke({0x8234bd8?, 0xc0009ec2d0?}, {0x7636bed?, 0x4?}, {0x70b8240, 0xc001c2fd40}, {0x70b8380, 0xc000010540}, 0x0?, {0xc000886060, ...})
    google.golang.org/grpc@v1.57.0/call.go:75 +0xa8
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryClientInterceptor.func1({0x8234bd8, 0xc0009ec210}, {0x7636bed, 0x3f}, {0x70b8240, 0xc001c2fd40}, {0x70b8380, 0xc000010540}, 0xc001d6a000, 0x77f86f8, ...)
    go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.42.1-0.20230612162650-64be7e574a17/interceptor.go:100 +0x3e4
google.golang.org/grpc.(*ClientConn).Invoke(0xc001d6a000, {0x8234bd8, 0xc0009ec210}, {0x7636bed, 0x3f}, {0x70b8240, 0xc001c2fd40}, {0x70b8380, 0xc000010540}, {0xc000abb390, ...})
    google.golang.org/grpc@v1.57.0/call.go:40 +0x24d
go.opentelemetry.io/collector/pdata/internal/data/protogen/collector/metrics/v1.(*metricsServiceClient).Export(0xc000915098, {0x8234bd8, 0xc0009ec210}, 0xc0010c5770?, {0xc000abb390, 0x1, 0x1})
    go.opentelemetry.io/collector/pdata@v1.0.0-rcv0014/internal/data/protogen/collector/metrics/v1/metrics_service.pb.go:272 +0xc9
go.opentelemetry.io/collector/pdata/pmetric/pmetricotlp.(*grpcClient).Export(0x8234bd8?, {0x8234bd8?, 0xc0009ec210?}, {0xc0009ec1e0?}, {0xc000abb390?, 0xc000fca180?, 0x2?})
    go.opentelemetry.io/collector/pdata@v1.0.0-rcv0014/pmetric/pmetricotlp/grpc.go:41 +0x30
go.opentelemetry.io/collector/exporter/otlpexporter.(*baseExporter).pushMetrics(0xc000970580, {0x8234ba0?, 0xc0009ec1e0?}, {0x8234bd8?})
    go.opentelemetry.io/collector/exporter/otlpexporter@v0.82.0/otlp.go:107 +0x87
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsRequest).Export(0x8234bd8?, {0x8234ba0?, 0xc0009ec1e0?})
    go.opentelemetry.io/collector/exporter@v0.82.0/exporterhelper/metrics.go:54 +0x34
go.opentelemetry.io/collector/exporter/exporterhelper.(*timeoutSender).send(0xc000af4708, {0x8257508, 0xc000a054a0})
    go.opentelemetry.io/collector/exporter@v0.82.0/exporterhelper/common.go:197 +0x96
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send(0xc000b36e60, {0x8257508, 0xc000a054a0})
    go.opentelemetry.io/collector/exporter@v0.82.0/exporterhelper/queued_retry.go:384 +0x596
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send(0xc00094ec00, {0x8257508, 0xc000a054a0})
    go.opentelemetry.io/collector/exporter@v0.82.0/exporterhelper/metrics.go:125 +0x88
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1({0x8257508, 0xc000a054a0})
    go.opentelemetry.io/collector/exporter@v0.82.0/exporterhelper/queued_retry.go:195 +0x39
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1()
    go.opentelemetry.io/collector/exporter@v0.82.0/exporterhelper/internal/bounded_memory_queue.go:47 +0xb6
created by go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers
    go.opentelemetry.io/collector/exporter@v0.82.0/exporterhelper/internal/bounded_memory_queue.go:42 +0x45

Additional context

No response

dmitryax commented 1 year ago

Looks like another instance of https://github.com/open-telemetry/opentelemetry-collector/issues/6794.

@stephenhong did you have this issue before 0.82.0?

@mx-psi, @songy23, @mackjmr is it possible that datadog exporter got a mutable operation on the original metrics pdata recently?

mx-psi commented 1 year ago

This could be related to the Datadog exporter. My current guess is DataDog/opentelemetry-mapping-go/pull/101 that was enabled in the Collector on #23445. I'll confirm with @gbbr and open a PR to set MutatesData to true.

stephenhong commented 1 year ago

@dmitryax I saw this issue in v0.80.0 as well during local testing. I was using an old version of the Otel Java agent and the collector was not using the Datadog exporter. The error didn't show up again after switching the collector version to 0.82.0 so I thought it was fixed. But then when I enabled the Datadog exporter, I saw this error come up again

mx-psi commented 1 year ago

Thanks for the report @stephenhong, it is expected that you would see this in v0.80.0 as well if the underlying cause is https://github.com/DataDog/opentelemetry-mapping-go/pull/101. This will be fixed on v0.83.0 by the PR that closed this issue.