open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.13k stars 2.39k forks source link

[pkg/ottl] truncate_all function corrupts UTF-8 encoding #36017

Open mugli opened 1 month ago

mugli commented 1 month ago

Component(s)

pkg/ottl

What happened?

Description

The truncate_all OTTL function truncates the string (UTF-8 encoded byte slice in Go) without checking if it breaks multi-byte UTF-8 codepoints in the middle. So using the transform processor with truncate_all often results in corrupted UTF-8 values.

Here's the code: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/ed099909a87ac3d51f577b9e86e15369481861a3/pkg/ottl/ottlfuncs/func_truncate_all.go#L50

(It looks like opentelemetry-go SDK had a similar problem which is fixed now: https://github.com/open-telemetry/opentelemetry-go/issues/3160)

Impact: While the collector or a chain of collectors usually works fine with the corrupted UTF-8 down the line (https://github.com/open-telemetry/opentelemetry-collector/issues/11449), some components fail:

Steps to Reproduce

The truncate_all slicing code should be self-explanatory that it doesn't handle rune.

Expected Result

truncate_all will slice the string up to the given length. If truncating at exactly the length results in a broken UTF-8 encoding, it'll truncate before where the last rune started.

Actual Result

truncate_all slices UTF-8 encoded string without checking if it breaks a multi-byte character in the middle. This results in the entire batch getting dropped on the vendor side.

Collector version

v0.111.0

Environment information

Environment

This applies to all environment

OpenTelemetry Collector configuration

extensions:
  health_check: {}
receivers:
  otlp:
    protocols:
      grpc:
      http:
processors:
  batch:
  transform:
    trace_statements:
      - context: span
        statements:
        - truncate_all(attributes, 4095)
        - truncate_all(resource.attributes, 4095)
exporters:
  otlphttp:
    endpoint: https://otlp.nr-data.net
    headers:
      api-key: $NEW_RELIC_API_KEY
service:
  extensions: [health_check]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [transform, batch]
      exporters: [otlphttp]
github-actions[bot] commented 1 month ago

Pinging code owners:

mugli commented 3 weeks ago

https://github.com/open-telemetry/opentelemetry-go/pull/3156 seems like a quick fix, but iterating over the entire string is inefficient. Especially for truncate_all function that iterates over every incoming span, we need it to be efficient.

Tailscale's UTF-8 aware truncation seems much better, although licensed under BSD-3. I did not find any SPDX-License-Identifier in this repository that's not Apache-2.0. Not sure how OpenTelemetry authors would prefer to approach this, but happy to offer a PR if someone has any input. 🙏

TylerHelmuth commented 3 weeks ago

We're definitely interested in a fix. If there isn't an efficient solution we could add an optional parameter that enables when to check for this kind of truncation.

mugli commented 1 week ago

If anyone else is facing the same problem, switching to HTTP(s) with JSON payload on the exporter temporarily solves it during exporting (the ingestion side is happy and does not drop the entire batch). The JSON marshaller in Go coerces invalid UTF-8 byte sequences to valid UTF-8 by replacing them with U+FFFD.

Screenshot 2024-11-11 at 13 39 56

This is not a fix, but just a temporary workaround.

yigithankarabulut commented 2 days ago

I want to join this contribfest! Could you assign it to me?