open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.08k stars 2.38k forks source link

OTEL collector crashes when using googlecloudpubsub receiver with encoding set to cloud_logging #32007

Open ZachTB123 opened 7 months ago

ZachTB123 commented 7 months ago

Component(s)

receiver/googlecloudpubsub

What happened?

Description

I'm trying to use the googlecloudpubsub receiver to receive Cloud Logs. I have configured a log router to route all my logs to a pub/sub topic. The inclusion filter on the sink is resource.type = ("cloud_run_revision") OR log_id("dialogflow-runtime.googleapis.com/requests"). I have no exclusion filter. After some time, the collector crashes with the log output below.

Setting encoding to raw_text works without issue.

Steps to Reproduce

  1. Create a log router described above.
  2. Run the collector with the configuration below.

Expected Result

The collector does not crash.

Actual Result

The collector crashes.

Collector version

v0.97.0

Environment information

No response

OpenTelemetry Collector configuration

receivers:
  googlecloudpubsub:
    project: my-project
    subscription: my-subscription
    encoding: cloud_logging

processors: {}

exporters:
  logging/debug:
    loglevel: debug
  logging/error:
    loglevel: error

service:
  telemetry:
    logs:
      level: DEBUG
  pipelines:
    logs:
      receivers: [googlecloudpubsub]
      processors: []
      exporters: [logging/debug]

Log output

panic: runtime error: index out of range [8] with length 8

goroutine 59 [running]:
encoding/hex.Decode({0xc002c69968?, 0x0?, 0xc001cb54d0?}, {0xc001ca2a80?, 0xc001e28c80?, 0xc001cb54d0?})
    encoding/hex/hex.go:101 +0x130
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/googlecloudpubsubreceiver/internal.spanIDStrToSpanIDBytes({0xc001ca2a80?, 0xc001c93230?})
    github.com/open-telemetry/opentelemetry-collector-contrib/receiver/googlecloudpubsubreceiver@v0.97.0/internal/log_entry.go:59 +0x4b
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/googlecloudpubsubreceiver/internal.TranslateLogEntry({0x58b?, 0x58b?}, 0xc002c69c60?, {0xc002d45680, 0x45b, 0x480})
    github.com/open-telemetry/opentelemetry-collector-contrib/receiver/googlecloudpubsubreceiver@v0.97.0/internal/log_entry.go:231 +0x43d
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/googlecloudpubsubreceiver.(*pubsubReceiver).handleCloudLoggingLogEntry(0xc0028056b0, {0x948ef20, 0xef7b1c0}, 0xa43440?)
    github.com/open-telemetry/opentelemetry-collector-contrib/receiver/googlecloudpubsubreceiver@v0.97.0/receiver.go:145 +0x56
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/googlecloudpubsubreceiver.(*pubsubReceiver).createReceiverHandler.func1({0x948ef20, 0xef7b1c0}, 0xc002d68050)
    github.com/open-telemetry/opentelemetry-collector-contrib/receiver/googlecloudpubsubreceiver@v0.97.0/receiver.go:299 +0x186
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/googlecloudpubsubreceiver/internal.(*StreamHandler).responseStream(0xc0028a86e0, {0x94904b8, 0xc001c8e6e0}, 0xc001c8c530)
    github.com/open-telemetry/opentelemetry-collector-contrib/receiver/googlecloudpubsubreceiver@v0.97.0/internal/handler.go:193 +0x65d
created by github.com/open-telemetry/opentelemetry-collector-contrib/receiver/googlecloudpubsubreceiver/internal.(*StreamHandler).recoverableStream in goroutine 46
    github.com/open-telemetry/opentelemetry-collector-contrib/receiver/googlecloudpubsubreceiver@v0.97.0/internal/handler.go:109 +0x1cb

Additional context

No response

github-actions[bot] commented 7 months ago

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

crobert-1 commented 7 months ago

It looks like this panic is happening because a spanId is longer than expected. From the spec, spanId must be an 8-byte array.

Can you provide a sample log that's causing this panic to happen so we can confirm this to be the case?

Incoming data being the wrong format shouldn't cause the collector to panic. The receiver should log an error and drop data instead.

ZachTB123 commented 7 months ago

I believe this is coming from log entries where logName is equal to projects/project-id/logs/run.googleapis.com%2Frequests. Based on some previous logs that I've ingested by setting encoding to raw_text, the value for spanId is 20 characters long. For example:

{
    "spanId": "15426074336963245120"
}
alexvanboxel commented 7 months ago

I will have a look, this ticket can be assigned to me

alexvanboxel commented 6 months ago

This issue is reproducible, but I've logged an issue with Google Cloud as it's a bug on their side: https://issuetracker.google.com/issues/338634230

I will make the parsing safer so the collector doesn't crash, but I will not detect decimals; I will handle it as a too-large HEX.

github-actions[bot] commented 4 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

geekflyer commented 4 months ago

can we do a fix/workaround on the otel collector side for this? I bet GCP is gonna take a while to change this in cloud run.

tjun commented 2 months ago

@alexvanboxel Hi, Thank you for your PR! Would it be possible to reopen https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/33247 and have it merged? We have been trying to use googlecloudpubsub receiver with cloud_logging encoding and have frequently encountered this crashing issue, which has been troubling us. However, when we incorporated the code from your PR and tested it, the problem no longer occurred. We would be very happy if your PR could be merged and made available for use.

github-actions[bot] commented 3 weeks ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.