Closed snazy closed 1 year ago
/cc @brunobat (opentelemetry), @radcortez (opentelemetry)
Hi @snazy, not sure what kind of tests you are performing that lead to an OOM... I would need to have an example. The recommended config for tests with OTel, if using Quarkus 3 is something like this:
quarkus.otel.bsp.schedule.delay=50
quarkus.otel.bsp.export.timeout=1s
This makes sure everything is flushed in a timely fashion because the defaults are tuned for prod. See: quarkus.otel.bsp.schedule.delay quarkus.otel.bsp.export.timeout
Please let me know if this works for you. I can document this issue.
It's a legit bug caused by the usage of okio/okhttp w/ that watchdog thread in tests.
It's not about flushing, it's that the watchdog thread never terminates and keeps all classes on the heap, eventually producing an OOM.
Can repro at will:
./gradlew :nessie-quarkus:test
with the settings in servers/quarkus-server/src/main/resources/application.properties
(end of the file) removedYou can also attach a debugger to it and check that the Watchdog thread keeps running and never terminates.
What are the memory definitions that you use @snazy ?
At max 1.5G heap IIRC, maybe less
I cannot run the tests. Throws me a too many open files before I reach to any OOM. I'm in a locked mac. No way to increase more of that. Is there a specific test where the OOM will happen more often? (will find a different, slower machine)
There are not that many test classes, OOMs pretty much „at the end“.
I asked @alesj to please take a look.
I already got this on the main
branch (which I ran by mistake ...)
Where as this never stopped: https://gist.github.com/alesj/60d7467f85644dd639869f3c23833818 ... 42min and it would go on ...
@snazy any easier way to reproduce this -- not to have the whole Nessie project ... ?
Probably - you could create a test project with many test classes, and each produces an otel event.
@snazy what if you set this
for the okio
library.
This way it will only load okio
classes once for all tests.
@snazy any progress / luck with this parent-first
?
Or easier reproducer ...
Sorry, no luck so far. It's pretty clear that it's caused by okio, already known and was hit before. I think, the current idea is to remove okio. It's "only" a test-issue (prod-like doesn't reload stuff).
So I assume the workaround we need is to have our own implementation of io.opentelemetry.exporter.internal.grpc.GrpcExporter
which would be used instead of io.opentelemetry.exporter.internal.grpc.OkHttpGrpcExporter
?
Yes and not just for this reason. We need to drop OkHttp because of support reasons.
Good point
So I assume this is one for @alesj due to the need for gRPC
Yes, it would be nice
I might be misreading OkHttpGrpcExporter
, but it seems to simply be making a HTTP call. Is that correct or am I missing something?
The OkHttp lib is able to do both GRPC and HTTP, there is a config for that.
Yeah, I see that. My point is that in reality, OkHttpGrpcExporter
doesn't really need to know much about gRPC.
And actually this is true, as the description of the original PR that introduced the class above says.
Correct. Also, beware of this: https://github.com/open-telemetry/opentelemetry-java/pull/5557/ We might not need to implement much.
I saw that too, yeah. But from a quick read through it, doesn't seem to change any gRPC related stuff.
According to this, it's not possible to use the JDK HTTP Client because there is not trailing header support.
@cescoffier @alesj I assume we could leverage the Vert.x GrpcClient
or even the HttpClient
here as all we need to do is shove some bytes over HTTP2 (see what they are currently doing with OkHttp). WDYT?
That's actually my preference as well. Before this JDK HTTP Client implementation there was no API to plug a different client. Everything was tight to OkHttp. Now there is one.
What do you call trailing headers? trailer? gRPC is often using trailers, and we do support that.
@cescoffier yeah trailer
and we do support that.
so it should be possible to use the pure GrpcClient
client from Vert.x to perform the write operation we need?
Yes, it should be fine.
Cool, thanks
I assume https://vertx.io/docs/vertx-grpc/java/#_message_level_api_2 is what we are after. I'll try and have a look next week
I have a prototype here. I think it's a pretty good starting point and if you agree I can move it forward, it should definitely be reviewed / improved by Vert.x / gRPC / OTel folks
I got native to work, so now I think it's more a matter of hardening
Describe the bug
okio starts a thread via the class
okio.AsyncTimeout.Watchdog
once "it has to deal with a timeout". The thread's configured as a daemon thread and seems to have some shutdown logic, and also aborts when its interrupted.The behavior isn't an issue in production usecases, but it's an issue when running tests.
okio.AsyncTimeout.Watchdog
is loaded by a Quarkus class loader for every test, so it implicitly keeps a reference to its class loader and transitively to all the resources that one holds - just because the thread's still running.Setting the following parameters disables timeouts and in turn does not start that watchdog thread and the OOM doesn't happen. It's maybe a legit workaround for tests, until the issue's fixed.
This behavior seems to be introduced after Quarkus 3.0.0.Alpha5, but I'm not sure why, because okio seems to behave this way "forever".
Expected behavior
No response
Actual behavior
No response
How to Reproduce?
No response
Output of
uname -a
orver
No response
Output of
java -version
No response
GraalVM version (if different from Java)
No response
Quarkus version or git rev
No response
Build tool (ie. output of
mvnw --version
orgradlew --version
)No response
Additional information
No response