quarkusio / quarkus

Quarkus: Supersonic Subatomic Java.
https://quarkus.io
Apache License 2.0
13.54k stars 2.61k forks source link

OIDC traces not sent properly at startup #39644

Open goldmann opened 5 months ago

goldmann commented 5 months ago

Describe the bug

When the oidc extension is used together with otel extension I see warning in the logs which suggest that the traces from oidc are not published to the collector.

INFO exec -a "java" java -Dquarkus.http.host=0.0.0.0 -Djava.util.logging.manager=org.jboss.logmanager.LogManager -cp "." -jar /deployments/quarkus-run.jar 
INFO running in /deployments
Picked up JAVA_TOOL_OPTIONS: -XX:MaxRAMPercentage=70.0 -XX:+ExitOnOutOfMemoryError
 .d8888b.  888888b.    .d88888b.  888b     d888                  
d88P  Y88b 888  "88b  d88P" "Y88b 8888b   d8888                  
Y88b.      888  .88P  888     888 88888b.d88888                  
 "Y888b.   8888888K.  888     888 888Y88888P888  .d88b.  888d888 
    "Y88b. 888  "Y88b 888     888 888 Y888P 888 d8P  Y8b 888P"   
      "888 888    888 888     888 888  Y8P  888 88888888 888     
Y88b  d88P 888   d88P Y88b. .d88P 888   "   888 Y8b.     888     
 "Y8888P"  8888888P"   "Y88888P"  888       888  "Y8888  888     

                                         Powered by Quarkus 3.7.2
08:47:29,820 INFO  traceId=, parentId=, spanId=, sampled= [io.qua.oid.com.run.OidcCommonUtils] (main) Connecting to OpenId Connect Provider for up to 15 times every 2 seconds mdc:[{}]
08:47:31,818 WARN  traceId=, parentId=, spanId=, sampled= [io.qua.ope.run.tra.int.ver.OpenTelemetryVertxTracingFactory$VertxDelegator] (vert.x-eventloop-thread-1) VertxTracer delegate not set. Will not submit this trace. SpanKind: RPC; Request: io.vertx.core.http.impl.HttpRequestHead@1d185e90; Operation:GET. mdc:[{}]
08:47:31,919 WARN  traceId=, parentId=, spanId=, sampled= [io.qua.ope.run.tra.int.ver.OpenTelemetryVertxTracingFactory$VertxDelegator] (vert.x-eventloop-thread-1) VertxTracer delegate not set. Will not submit this trace. Response: io.vertx.core.http.impl.HttpResponseHead@3b7d824f; Failure: null. mdc:[{}]
08:47:32,035 WARN  traceId=, parentId=, spanId=, sampled= [io.qua.ope.run.tra.int.ver.OpenTelemetryVertxTracingFactory$VertxDelegator] (vert.x-eventloop-thread-1) VertxTracer delegate not set. Will not submit this trace. SpanKind: RPC; Request: io.vertx.core.http.impl.HttpRequestHead@60dfc063; Operation:GET. mdc:[{}]
08:47:32,117 WARN  traceId=, parentId=, spanId=, sampled= [io.qua.ope.run.tra.int.ver.OpenTelemetryVertxTracingFactory$VertxDelegator] (vert.x-eventloop-thread-1) VertxTracer delegate not set. Will not submit this trace. Response: io.vertx.core.http.impl.HttpResponseHead@6a86bd1f; Failure: null. mdc:[{}]
08:47:36,525 INFO  traceId=, parentId=, spanId=, sampled= [io.qua.sch.run.SimpleScheduler] (main) No scheduled business methods found - Simple scheduler will not be started mdc:[{}]
08:47:36,631 INFO  traceId=, parentId=, spanId=, sampled= [org.jbo.sbo.ser.fea.sbo.fea.umb.AmqpUmbClientOptionProducer] (main) Setting up AMQP client options mdc:[{}]
...

I've checked the logs in debug mode and just before these warning a connection is established to the auth server.

Potentially this is an OIDC extension issue, which should be setup just a little bit later, after the OTEL one can process traces.

This is related to https://github.com/quarkusio/quarkus/issues/37718

Expected behavior

No response

Actual behavior

No response

How to Reproduce?

No response

Output of uname -a or ver

No response

Output of java -version

No response

Quarkus version or git rev

3.7.2

Build tool (ie. output of mvnw --version or gradlew --version)

No response

Additional information

No response

quarkus-bot[bot] commented 5 months ago

/cc @brunobat (opentelemetry,tracing), @pedroigor (oidc), @radcortez (opentelemetry,tracing), @sberyozkin (oidc)

sberyozkin commented 5 months ago

CC @michalvavrik

geoand commented 5 months ago

Do the traces get sent correctly when OIDC is not part of the picture?

geoand commented 5 months ago

Also, can you try Quarkus 3.8 and 3.9 please?

brunobat commented 5 months ago

Please mind this is just a startup issue as discussed here: https://github.com/quarkusio/quarkus/issues/37718#issuecomment-2014570612

OIDC seems to be connecting out before Quarkus startup has finished.

sberyozkin commented 5 months ago

Hi @brunobat, Is it an OIDC specific issue, or, can for example, OpenTel capture events related to the Kafka client or Hibernate establishing its connections ? OIDC initialization has to start when it is given an initialized Vert.x instance, from the OIDC perspective the Quarkus is ready at this point

sberyozkin commented 5 months ago

I don't think OIDC can be specifically aware of OTel, it has to work with or without OTel, so the question is, how OIDC or indeed other extensions creating some connections can know all is ready, with and without OTel. Should OTel queue the events until it is ready ?

brunobat commented 5 months ago

As discussed earlier in the other PR, the OTel extension needs vert.x as well, and can only be instantiated after vert.x is up. All communications happening until OTel is ready will not generate any spans and in the case of vert.x, there is a log notifying these "lost" spans. The idea of caching unsent spans until OTel is ready was floated, however creating spans without the SDK is not likely and artificially feeding them later will mess up their timestamps anyway, which is arguably even worse. I imagine there is a lot of stuff stating after vert.x is up, not just OTel, I wonder if it's wise to issue requests before the end of the boot process. Probably even before the health check is up...

I guess this is a broader question, not just for OIDC.

geoand commented 5 months ago

So IIUC, the problem here is that OIDC is making a request once Vertx is setup but before OTel is?

If so, I think this situation is perfectly legal and we should just make sure that it does not result in bogus warnings

michalvavrik commented 5 months ago

I didn't plan to comment because without reproducer I'm not sure I can debug this but last comment surprised me.

So IIUC, the problem here is that OIDC is making a request once Vertx is setup but before OTel is?

If so, I think this situation is perfectly legal and we should just make sure that it does not result in bogus warnings

If so, why can't we just enforce order between OIDC and OTel like we did between OTel and Agroal with io.quarkus.agroal.spi.OpenTelemetryInitBuildItem ? Or maybe that's what you meant, sorry, I am not sure.

brunobat commented 5 months ago

I think that can work, yes

geoand commented 5 months ago

If we really want OIDC to start after OTel (I can't comment on whether that's a valid goal), then of course that is doable (I assume fairly easily)

michalvavrik commented 5 months ago

If we really want OIDC to start after OTel (I can't comment on whether that's a valid goal), then of course that is doable (I assume fairly easily)

Ok, thanks for the explanation. I don't see why not, what is disadvantage as long as both happens before application starts to receive requests (e.g. before io.quarkus.deployment.builditem.ApplicationStartBuildItem), but maybe Sergey will know. I can't comment on that either.

sberyozkin commented 5 months ago

I'm not sure of the side-effects. Does it make sense to all of Quarkus extensions which may want to report something at the startup ?

brunobat commented 5 months ago

Yes, it makes sense to delay their instantiation, because otherwise we will not be able to track those requests reliably.

geoand commented 5 months ago

I also am not sure which calls should be traced... What I definitely don't think makes sense is to have all outgoing connections be made only after OTel starts. To me it makes sense to decide on a case per case basis.

sberyozkin commented 5 months ago

@geoand Right, there is no any specific use case related to the OIDC start up, compared to any other extension which is attempting to set up a remote connection with the remote server being temporarily unavailable.

sberyozkin commented 5 months ago

What Michal did, was to report the connection failure event if the OIDC server was not ready, and then report that it was re-established at runtime during the first request. This is certainly a useful kind of info to report. But it also a general type of report that can be of interest not only for OIDC (for HashiCorp, for any extension which needs to set up a connection for its work).

May be we can introduce a build time property, which if enabled, would require OIDC, or any other extension which would like to do a similar kind of report, to wait until OTel is available as opposed to introducing extension specific build items one by one and do it non-optionally ?

brunobat commented 5 months ago

I think the configuration is not necessary. If an error can happen on that call and it's relevant, it seems to me that it must be traced. If there is no impediment to delay the startup of OIDC until after OTel, if OTel is present, this should be done.

sberyozkin commented 5 months ago

@brunobat Hey, I agree it should be done, I'm wondering though if it can be treated in a more generic way, as opposed to per-extension specific delaying of the startup. As I said, a connection failure at the startup can happen not only in the case of OIDC. I'm also not 100% sure, delaying the OIDC startup so that it can report events which can never happen (connection failure, as opposed to for ex the authentication success/failure) should be done by default

michalvavrik commented 5 months ago

@brunobat Hey, I agree it should be done, I'm wondering though if it can be treated in a more generic way, as opposed to per-extension specific delaying of the startup.

FWIW (please correct me where I am wrong): If OpenTelemetry extension is not ready to attach data / to receive the request data, it can't just keep this data till it becomes ready because the data are not finite. They need to be accessible / adjustable, we can't just reply steps when the OTel becomes ready because what happens depends on a state stored in the context.

As I said, a connection failure at the startup can happen not only in the case of OIDC. I'm also not 100% sure, delaying the OIDC startup so that it can report events which can never happen (connection failure, as opposed to for ex the authentication success/failure) should be done by default

OpenTelemtry as well as OIDC will always be ready for authentication success/failure because application only receives incoming requests when all the extensions are ready. We are only discussing outgoing requests during application startup and that can either be ordered (so that we can put OTel before OIDC) or not. Personally I don't see issue with starting OTel before OIDC if there is demand.

What I think should be decided is whether these data between auth server and Quarkus should be tracked, because to my knowledge it is only tracked because @sberyozkin decided to use Vert.x web client (+1), but there was never actual decision to provide traces for this (that happens through Vert.x).

I can't answer the last paragraph.

michalvavrik commented 2 months ago

I didn't test it, but I think I have moved related code to a separate build steps in the https://github.com/quarkusio/quarkus/pull/41571. If we really want to fix this, all that needs to be done is to introduce build-time SPI and add @Consume on these buildsteps.

Neither OIDC or OTel has build-time SPI. There doesn't seem to be agreement if it is OIDC specific or whether there should be OTel SPI.