open-telemetry / opentelemetry-rust

The Rust OpenTelemetry implementation
https://opentelemetry.io
Apache License 2.0
1.79k stars 414 forks source link

OOM Killed when collector client can not connect #956

Closed thedodd closed 1 year ago

thedodd commented 1 year ago

Over the course of some time, perhaps due to collector maintenance or other circumstances, the OTEL tracing mechanism seems to run into an issue where it goes into a VERY hot loop, spams the following error message to stdout/stderr (OpenTelemetry trace error occurred. cannot send span to the batch span processor because the channel is full), and then eventually causes the app to be OOM killed due to excess memory usage.

Deps

opentelemetry = { version = "0.18", default-features = false, features = ["rt-tokio", "trace"] }
opentelemetry-jaeger = { version = "0.17", default-features = false, features = ["rt-tokio", "collector_client", "reqwest_collector_client"] }
tracing = { version = "0.1", default-features = false, features = ["std", "attributes", "release_max_level_debug"] }
tracing-opentelemetry = { version = "0.18", default-features = false, features = ["tracing-log"] }
tracing-subscriber = { version = "0.3", default-features = false, features = ["env-filter", "fmt", "json", "smallvec", "std", "tracing-log"] }

Initialization

let tracer = opentelemetry_jaeger::new_collector_pipeline()
    .with_service_name("...")
    .with_reqwest()
    .with_timeout(Duration::from_secs(5))
    .with_batch_processor_config(BatchConfig::default().with_max_export_timeout(Duration::from_secs(5)))
    .install_batch(opentelemetry::runtime::Tokio)?;
let telem = tracing_opentelemetry::layer().with_tracer(tracer);
set_text_map_propagator(JaegerPropagator::new());

Env

OTEL_EXPORTER_JAEGER_ENDPOINT="http://tracing-collector-endpoint-which-works-just-fine"
OTEL_EXPORTER_JAEGER_PROTOCOL="udp/thrift.binary"
OTEL_SERVICE_NAME="..."
OTEL_TRACES_SAMPLER="always_on"

All usage of tracing throughout the app is via function instrumentation: #[tracing::instrument(...)] and various calls to tracing::debug!/info!/warn!/error! &c.

TommyCpp commented 1 year ago

Thanks for the bug report. The error messages you saw indicate the span is dropped when ended because of the exporter backpressure. We use a bounded channel to send span to span processor so the OOM is not likely to come from there.

I am thinking it may be something inside jaeger exporter. I can dig deeper and see if the OOM can cause by jaeger exporters.

TommyCpp commented 1 year ago

The first step I had in mind is to try to locate the memory leak to a component. Over the weekend I tested some examples with only opentelemetry based on the descriptions above with Valgrind and it didn't report any memory leak. So I feel like I missed something in my setup. The next step is probably to include tracing-opentelemetry and see if it causes a memory leak.

Curious when you say "collector client can not connect", do you mean the backend latency increased to a crazy high or the client just will encounter something like connection refused?

thedodd commented 1 year ago

Basically just connection refused. In these cases, it is typically due to the collector pods in kubernetes being rescheduled or the like.

thedodd commented 1 year ago

As it turns out, it appears as though the issue was closely related to what is described here: https://fly.io/blog/rust-memory-leak/

Thanks for taking a look, looks like it was a mistake on my end.