quarkusio / quarkus

Quarkus: Supersonic Subatomic Java.
https://quarkus.io
Apache License 2.0
13.8k stars 2.68k forks source link

OOM in Quarkus 3.0.0.Beta1 caused by okio via OpenTelemetry #32238

Closed snazy closed 1 year ago

snazy commented 1 year ago

Describe the bug

okio starts a thread via the class okio.AsyncTimeout.Watchdog once "it has to deal with a timeout". The thread's configured as a daemon thread and seems to have some shutdown logic, and also aborts when its interrupted.

The behavior isn't an issue in production usecases, but it's an issue when running tests.

okio.AsyncTimeout.Watchdog is loaded by a Quarkus class loader for every test, so it implicitly keeps a reference to its class loader and transitively to all the resources that one holds - just because the thread's still running.

Setting the following parameters disables timeouts and in turn does not start that watchdog thread and the OOM doesn't happen. It's maybe a legit workaround for tests, until the issue's fixed.

quarkus.otel.exporter.otlp.timeout=0
quarkus.otel.exporter.otlp.traces.timeout=0

This behavior seems to be introduced after Quarkus 3.0.0.Alpha5, but I'm not sure why, because okio seems to behave this way "forever".

Expected behavior

No response

Actual behavior

No response

How to Reproduce?

No response

Output of uname -a or ver

No response

Output of java -version

No response

GraalVM version (if different from Java)

No response

Quarkus version or git rev

No response

Build tool (ie. output of mvnw --version or gradlew --version)

No response

Additional information

No response

quarkus-bot[bot] commented 1 year ago

/cc @brunobat (opentelemetry), @radcortez (opentelemetry)

brunobat commented 1 year ago

Hi @snazy, not sure what kind of tests you are performing that lead to an OOM... I would need to have an example. The recommended config for tests with OTel, if using Quarkus 3 is something like this:

quarkus.otel.bsp.schedule.delay=50
quarkus.otel.bsp.export.timeout=1s

This makes sure everything is flushed in a timely fashion because the defaults are tuned for prod. See: quarkus.otel.bsp.schedule.delay quarkus.otel.bsp.export.timeout

Please let me know if this works for you. I can document this issue.

snazy commented 1 year ago

It's a legit bug caused by the usage of okio/okhttp w/ that watchdog thread in tests.

It's not about flushing, it's that the watchdog thread never terminates and keeps all classes on the heap, eventually producing an OOM.

Can repro at will:

You can also attach a debugger to it and check that the Watchdog thread keeps running and never terminates.

brunobat commented 1 year ago

What are the memory definitions that you use @snazy ?

snazy commented 1 year ago

At max 1.5G heap IIRC, maybe less

brunobat commented 1 year ago

I cannot run the tests. Throws me a too many open files before I reach to any OOM. I'm in a locked mac. No way to increase more of that. Is there a specific test where the OOM will happen more often? (will find a different, slower machine)

snazy commented 1 year ago

There are not that many test classes, OOMs pretty much „at the end“.

brunobat commented 1 year ago

I asked @alesj to please take a look.

alesj commented 1 year ago

I already got this on the main branch (which I ran by mistake ...)

alesj commented 1 year ago

Where as this never stopped: https://gist.github.com/alesj/60d7467f85644dd639869f3c23833818 ... 42min and it would go on ...

@snazy any easier way to reproduce this -- not to have the whole Nessie project ... ?

snazy commented 1 year ago

Probably - you could create a test project with many test classes, and each produces an otel event.

alesj commented 1 year ago

@snazy what if you set this

for the okio library. This way it will only load okio classes once for all tests.

alesj commented 1 year ago

@snazy any progress / luck with this parent-first? Or easier reproducer ...

snazy commented 1 year ago

Sorry, no luck so far. It's pretty clear that it's caused by okio, already known and was hit before. I think, the current idea is to remove okio. It's "only" a test-issue (prod-like doesn't reload stuff).

geoand commented 1 year ago

So I assume the workaround we need is to have our own implementation of io.opentelemetry.exporter.internal.grpc.GrpcExporter which would be used instead of io.opentelemetry.exporter.internal.grpc.OkHttpGrpcExporter?

brunobat commented 1 year ago

Yes and not just for this reason. We need to drop OkHttp because of support reasons.

geoand commented 1 year ago

Good point

geoand commented 1 year ago

So I assume this is one for @alesj due to the need for gRPC

brunobat commented 1 year ago

Yes, it would be nice

geoand commented 1 year ago

I might be misreading OkHttpGrpcExporter, but it seems to simply be making a HTTP call. Is that correct or am I missing something?

brunobat commented 1 year ago

The OkHttp lib is able to do both GRPC and HTTP, there is a config for that.

geoand commented 1 year ago

Yeah, I see that. My point is that in reality, OkHttpGrpcExporter doesn't really need to know much about gRPC.

And actually this is true, as the description of the original PR that introduced the class above says.

brunobat commented 1 year ago

Correct. Also, beware of this: https://github.com/open-telemetry/opentelemetry-java/pull/5557/ We might not need to implement much.

geoand commented 1 year ago

I saw that too, yeah. But from a quick read through it, doesn't seem to change any gRPC related stuff.

geoand commented 1 year ago

According to this, it's not possible to use the JDK HTTP Client because there is not trailing header support.

@cescoffier @alesj I assume we could leverage the Vert.x GrpcClient or even the HttpClient here as all we need to do is shove some bytes over HTTP2 (see what they are currently doing with OkHttp). WDYT?

brunobat commented 1 year ago

That's actually my preference as well. Before this JDK HTTP Client implementation there was no API to plug a different client. Everything was tight to OkHttp. Now there is one.

cescoffier commented 1 year ago

What do you call trailing headers? trailer? gRPC is often using trailers, and we do support that.

geoand commented 1 year ago

@cescoffier yeah trailer

and we do support that.

so it should be possible to use the pure GrpcClient client from Vert.x to perform the write operation we need?

cescoffier commented 1 year ago

Yes, it should be fine.

geoand commented 1 year ago

Cool, thanks

geoand commented 1 year ago

I assume https://vertx.io/docs/vertx-grpc/java/#_message_level_api_2 is what we are after. I'll try and have a look next week

geoand commented 1 year ago

I have a prototype here. I think it's a pretty good starting point and if you agree I can move it forward, it should definitely be reviewed / improved by Vert.x / gRPC / OTel folks

geoand commented 1 year ago

I got native to work, so now I think it's more a matter of hardening