open-telemetry / opentelemetry-rust

The Rust OpenTelemetry implementation
https://opentelemetry.io
Apache License 2.0
1.79k stars 414 forks source link

Exporter jaeger encountered the following error(s): thrift agent failed with message too long #851

Closed taladar closed 5 months ago

taladar commented 2 years ago

I am using opentelemetry-jaeger 0.16.0 with tracing-opentelemetry 0.17.4 on Linux along with the latest Jaeger binary (non-docker) release.

I did see the old tickets on this issue (#648 #676 and #759 ) but none of the workarounds described there seem to have any significant impact on the issue.

I tried setting OTEL_BSP_MAX_EXPORT_BATCH_SIZE to progressively lower values down to 1 which did not make the error disappear. I tried using with_auto_split_batch(true) with with_max_packet_size with values of 8192, 4096, 1024, 512 and finally 256 and the error did not disappear. I tried install_simple() instead of install_batch(Tokio) and the error did not disappear.

I do not produce spans that are particularly large (just half a line of text and a host name as a parameter) but they are relatively short and plentiful.

Some data appears in jaeger but it is just a dozen or two spans before it aborts (per run of my program obviously).

I honestly have my doubts about the UDP packet size explanation in the previous tickets since everyone else seems to be able to handle sending larger amounts of data over UDP just fine.

Maybe it would be useful to add more information to the error message, e.g. how long it was, how many and which spans were batched and what the limit for 'too long' is.

bes commented 2 years ago

I too am experiencing this issue and the proposed fixes (as seen in different issues) don't work for me.

OTEL_BSP_MAX_EXPORT_BATCH_SIZE=25 OTEL_BSP_MAX_QUEUE_SIZE=32768 cargo run
fn init_tracer() -> Result<sdktrace::Tracer, TraceError> {
    opentelemetry_jaeger::new_pipeline()
        .with_service_name("trace-demo")
        .with_max_packet_size(9216) // Default max UDP packet size on OSX
        .with_auto_split_batch(true) // Auto split batches so they fit under packet size
        .install_batch(opentelemetry::runtime::Tokio)
}

#[tokio::main()]
async fn main() -> Result<(), Box<dyn Error + Send + Sync + 'static>> {
    // JAEGER
    // Create a layer with the configured tracer
    let tracer = init_tracer()?;
    let otel_layer = tracing_opentelemetry::layer().with_tracer(tracer);
    let subscriber = Registry::default().with(otel_layer);
    tracing::subscriber::set_global_default(subscriber).expect("setting default subscriber failed");
}

But still seeing

OpenTelemetry trace error occurred. Exporter jaeger encountered the following error(s): thrift agent failed with message too long

Before the changes above I saw

OpenTelemetry trace error occurred. cannot send span to the batch span processor because the channel is full

I am running a few hundred async tasks in parallel, and a jaeger instance locally.

fmassot commented 1 year ago

I observed the same issue on our OSS project https://github.com/quickwit-oss/quickwit/issues/2295

I tried a bunch of different settings but did not manage to make it work. The errors happen when there are a lot of spans, I will try to isolate that and report it here.

fmassot commented 1 year ago

I dig into the issue a bit and the problem comes from the number of bytes of spans that will be sent to jaeger.

I guess the error that you have is the same as mine (I added some printf! in the code to have that):

upload error ExportFailed(ThriftAgentError(ProtocolError { kind: SizeLimit, message: "single span's jaeger exporter payload size of 28330 bytes over max UDP packet size of 10000 bytes" }))

I solved the issue by doing two things:

@bes In your case, what is the max size of your UDP packet? I'm on macos and I had to run sudo sysctl -w net.inet.udp.maxdgram=65535 to have a decent size.

taladar commented 1 year ago

That whole protocol design seems broken if you need to tweak the max UDP package size to work around the flaws in its design.

TommyCpp commented 1 year ago

As an alternative. Have you tried auto_split_batch. This config will automatically split the span batches if it exceeded the UDP max size for one packet.

Note that it has a performance overhead

taladar commented 1 year ago

As mentioned in my first post, I did try that and it was just as broken.

TommyCpp commented 1 year ago

As mentioned in my first post, I did try that and it was just as broken.

Ah sorry I missed it. In this case, the most likely cause is that one of the spans exceeded the limit of UDP packet. Since we cannot split the span we have to fail the request.

As for the debugging information. We use the apache thrift rust client so the only information we will know is the error passed to us from thrift agent. I will see what's available and add some more context in the error message.

For the protocol design part, the UDP limit for jaeger is a known issue(See https://www.jaegertracing.io/docs/1.39/client-libraries/#emsgsize-and-udp-buffer-limits) so I don't think we can do more about that. One suggestion is to switch to http client w/ collector

taladar commented 1 year ago

add some more context in the error message.

It would probably be useful if we could get the messages size of the message that failed. The whole set of parameters I tried are very opaque in their effect. If you can't print the size maybe you could print the content you pass to thrift. I don't think performance matters very much at the point where nothing is really working.

One suggestion is to switch to http client w/ collector

Could you elaborate on that?

punkeel commented 1 year ago

macOS has a max UDP size set to 9216 by default. Would it be acceptable to adjust the values in this library to do the right thing out of the box? Requiring every project to figure this out through trial and error, then fix it, sounds like a waste of time.

$ uname -a
Darwin host 22.6.0 Darwin Kernel Version 22.6.0: Wed Jul  5 22:22:05 PDT 2023; root:xnu-8796.141.3~6/RELEASE_ARM64_T6000 arm64 arm Darwin
$ sysctl net.inet.udp.maxdgram
net.inet.udp.maxdgram: 9216
cijothomas commented 1 year ago

add some more context in the error message.

It would probably be useful if we could get the messages size of the message that failed. The whole set of parameters I tried are very opaque in their effect. If you can't print the size maybe you could print the content you pass to thrift. I don't think performance matters very much at the point where nothing is really working.

One suggestion is to switch to http client w/ collector

Could you elaborate on that?

The dedicated Jaeger exporter is going to be deprecated, so its unlikely to get bug fixes. The recommendation is to use use OTLP Exporter as Jaeger can now natively understand OTLP: https://github.com/open-telemetry/opentelemetry-rust/pull/1022 - This PR has an example. (During a recent refactoring, it looks like the example got lost, will work to bring it back.)

hdost commented 6 months ago

As mentioned we're looking to stop producing the crate , if you have feedback please leave it on #995

cijothomas commented 5 months ago

Closing jaeger related issues, given its imminent removal. See https://github.com/open-telemetry/opentelemetry-rust/issues/995