quarkusio / quarkus

Quarkus: Supersonic Subatomic Java.
https://quarkus.io
Apache License 2.0
13.54k stars 2.61k forks source link

WebSocket + Role-based authentication stopped working with Quarkus 3.9.x: Security Identity is not available #40307

Closed lyrst closed 4 months ago

lyrst commented 4 months ago

Describe the bug

I hava a WebSocket-endpoint that is secured with the "RolesAllowed"-annotation. In order to secure the websocket endpint I followed the approach that is described here https://github.com/quarkusio/quarkus/issues/29919

The issue that the Test testFoo1Endpoint does not even run with a NullPointerException.

As soon as I switch to Quarkus 3.8.4 the tests fails with an assertion-related error.

Expected behavior

The testFoo1Endpoint-tests is able to run without the NullPointerException.

Actual behavior

The testFoo1Endpoint-tests is fails with a NullPointerException.

How to Reproduce?

Run the following example code.

quarkus-sec-issue.zip

Output of uname -a or ver

Darwin Kernel Version 23.4.0: Fri Mar 15 00:12:25 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T6030 arm64

Output of java -version

No response

Quarkus version or git rev

3.9.4

Build tool (ie. output of mvnw --version or gradlew --version)

mvn

Additional information

No response

quarkus-bot[bot] commented 4 months ago

/cc @sberyozkin (security)

sberyozkin commented 4 months ago

@lyrst Thanks for the reproducer @mkouba @michalvavrik Are you aware of what might have changed that causes Websockets authenticator and configurator in the attached reproducer failing to provide the identity info with the token ? WebSockets next is an independent project AFAIK so should not have impacted the code which was running in 3.8.4

Thanks

sberyozkin commented 4 months ago

I don't recall us providing any doc-ed recommendations along those lines, so may be we can recommend to switch to WebSockets Next to achieve the expected RBAC ?

mkouba commented 4 months ago

Are you aware of what might have changed that causes Websockets authenticator and configurator in the attached reproducer failing to provide the identity info with the token ?

I have no idea.

WebSockets next is an independent project AFAIK so should not have impacted the code which was running in 3.8.4

Yes, I've created https://github.com/quarkusio/quarkus/issues/40312 to track our security-related efforts for WebSockets Next.

lyrst commented 4 months ago

I tried to use the new WebSocketNext extension as I thought that the RolesAllowed-annotation would work out of the box as it is based on the HttpServer (quote from documentation "The WebSocket handling reuses the main HTTP server.") but this does not seem to work ;-) But I guess I was lacking more information and as @mkouba created a new issue, it seems that more code changes are needed so make "WebSocketNext" work with the @RolesAllowed

mkouba commented 4 months ago

I tried to use the new WebSocketNext extension as I thought that the RolesAllowed-annotation would work out of the box as it is based on the HttpServer (quote from documentation "The WebSocket handling reuses the main HTTP server.") but this does not seem to work ;) But I guess I was lacking more information ;) and as @mkouba created a new issue, it seems that more code changes are needed so make "WebSocketNext" work with the @RolesAllowed

Yes, the integration with the security extension is currently WIP.

sberyozkin commented 4 months ago

Hi Martin, @mkouba, as far as this issue is concerned, can you please have a quick look at the #29919, as you helped there as well, and see if something rings a bell with respect to possible changes to the code referenced in #29919, between 3.8 and 3.9, np if nothing obvious springs to mind

michalvavrik commented 4 months ago

NPE is thrown in Quarkus REST handler, I'll have a look.

michalvavrik commented 4 months ago

I'm afraid this will take a CDI expert, at the very least to give me a hint. I'll write you an email.

mkouba commented 4 months ago

Hm, so if I understand it correctly the failing test (ExampleResourceTest) does not involve a WebSocket connection at all. It's merely testing the JAX-RS resource, right?

michalvavrik commented 4 months ago

Hm, so if I understand it correctly the failing test (ExampleResourceTest) does not involve a WebSocket connection at all. It's merely testing the JAX-RS resource, right?

That's correct. The HTTP request comes into the WebSocket handler, but leaves it as it does not detect the WS header.

michalvavrik commented 4 months ago

When Reactive Routes are used together with Quarkus REST and some @RouteFilter is applied before the Quarkus REST begin processing, CDI request context comes active to the Quarkus REST. However Quarkus REST creates a new CDI request context state with io.quarkus.arc.ManagedContext#activate(). I verified that when invoked on the same thread, the CDI request context state keeps changing. Here are some of my debug logging, not sure if useful for anyone else:

io.quarkus.vertx.runtime.VertxCurrentContextFactory is fallback
io.quarkus.vertx.runtime.VertxCurrentContextFactory is duplicate
io.quarkus.vertx.runtime.VertxCurrentContextFactory old state (null) is not a next state (io.quarkus.arc.impl.RequestContext$RequestContextState@313c23a7)
io.quarkus.vertx.runtime.VertxCurrentContextFactory setting next state to io.quarkus.arc.impl.RequestContext$RequestContextState@313c23a7
TestIdentityAssociation.setIdentity deferred - thread 124 context state io.quarkus.arc.impl.RequestContext$RequestContextState@313c23a7
///// routing context  /hello/foo1
|||||||||||||||| end rt ctx
io.quarkus.vertx.runtime.VertxCurrentContextFactory is duplicate
cdi request scope activation in REST thread 124 context state io.quarkus.arc.impl.RequestContext$RequestContextState@313c23a7
io.quarkus.vertx.runtime.VertxCurrentContextFactory is duplicate
io.quarkus.vertx.runtime.VertxCurrentContextFactory old state (io.quarkus.arc.impl.RequestContext$RequestContextState@313c23a7) is not a next state (io.quarkus.arc.impl.RequestContext$RequestContextState@1478249c)
io.quarkus.vertx.runtime.VertxCurrentContextFactory setting next state to io.quarkus.arc.impl.RequestContext$RequestContextState@1478249c
TestIdentityAssociation.setIdentity sync - thread 124 context state io.quarkus.arc.impl.RequestContext$RequestContextState@1478249c
io.quarkus.vertx.runtime.VertxCurrentContextFactory is duplicate
io.quarkus.vertx.runtime.VertxCurrentContextFactory old state (io.quarkus.arc.impl.RequestContext$RequestContextState@1478249c) is not a next state (io.quarkus.arc.impl.RequestContext$RequestContextState@313c23a7)
io.quarkus.vertx.runtime.VertxCurrentContextFactory setting next state to io.quarkus.arc.impl.RequestContext$RequestContextState@313c23a7
TestIdentityAssociation.getDeferredIdentity thread 124 context state io.quarkus.arc.impl.RequestContext$RequestContextState@313c23a7
io.quarkus.vertx.runtime.VertxCurrentContextFactory is duplicate
io.quarkus.vertx.runtime.VertxCurrentContextFactory old state (io.quarkus.arc.impl.RequestContext$RequestContextState@313c23a7) is not a next state (io.quarkus.arc.impl.RequestContext$RequestContextState@1478249c)
io.quarkus.vertx.runtime.VertxCurrentContextFactory setting next state to io.quarkus.arc.impl.RequestContext$RequestContextState@1478249c
io.quarkus.vertx.runtime.VertxCurrentContextFactory is duplicate
io.quarkus.vertx.runtime.VertxCurrentContextFactory old state (io.quarkus.arc.impl.RequestContext$RequestContextState@1478249c) is not a next state (io.quarkus.arc.impl.RequestContext$RequestContextState@313c23a7)
io.quarkus.vertx.runtime.VertxCurrentContextFactory setting next state to io.quarkus.arc.impl.RequestContext$RequestContextState@313c23a7
io.quarkus.vertx.runtime.VertxCurrentContextFactory is duplicate
io.quarkus.vertx.runtime.VertxCurrentContextFactory old state (io.quarkus.arc.impl.RequestContext$RequestContextState@313c23a7) is not a next state (io.quarkus.arc.impl.RequestContext$RequestContextState@1478249c)
io.quarkus.vertx.runtime.VertxCurrentContextFactory setting next state to io.quarkus.arc.impl.RequestContext$RequestContextState@1478249c

Interesting is also why the state has changed, I added printing of a stack trace, however it's too long to put here. See the attachment.

What fixes (or works around?) the issue is for example this: https://github.com/quarkusio/quarkus/commit/bf7880ab3 test-log.txt 65734ebe7aaa265dae724cd0010cd20

I found other ways to work around the issue, but it really requires a feedback from a CDI experts to determine the one that should be used. Use reproducer from the description. Thank you

cc @mkouba @manovotn @Ladicek @geoand

michalvavrik commented 4 months ago

My simplistic understanding is that the context state changes when subscribed Uni is created when the old context state is still in place, even though subscription is happening when the new context state is in place.

manovotn commented 4 months ago

It seems like AbstractResteasyReactiveContext#requireCDIRequestScope() completely omits the case where you already have active req. context that you didn't start up yourself and instead always tries to enforce new, blank instance through ThreadSetupAction#activateInitial. While I do not claim I understand the whole chain happening there (not even close in fact), this seems like an issue.

What fixes (or works around?) the issue is for example this: https://github.com/quarkusio/quarkus/commit/bf7880ab3

Yes, my debugging attempts landed at pretty much the same thing to resolve it - https://github.com/manovotn/quarkus/commit/7923d226a5b1aac0862e7c6055f450842b34f689

That being said, I am still unsure about:

michalvavrik commented 4 months ago

Very nice @manovotn , thank you for looking at it! I might not be a good adept to answer the questions raised, maybe someone else will, but I'll try.

Yes, my debugging attempts landed at pretty much the same thing to resolve it - https://github.com/manovotn/quarkus/commit/7923d226a5b1aac0862e7c6055f450842b34f689

lgtm, but tests will be needed. I can prepare them.

Is it expected that you can get a req, context that's already active? Seems like yes, because in this case, it's Vertx RouteHandler that does the activation and that seems perfectly valid?

I was asking myself a same thing, but considering that Reactive Routes are "finished product" (the way I see it; it's removed from RHBQ at least and recently dropped from SR JWT) I don't want to propose important changes there.

Given that it is permitted scenario, should we have some special handling for its deactivation given that someone else activated it and might want to deactive/destroy later?

My thinking was that we should not accept that users activate CDI request context. If user code or extensions activate CDI request context prior to Quarkus REST processing, they should destroy it as well. This is very early in processing and user code often needs Quarkus REST to prepare the request context for example with current identity and RoutingContext. We know how to handle and detect case when Reactive Routes activate it, but we can't really keep checking if someone destroyed active context during the processing inside Quarkus REST, can we? Wouldn't it require loop constantly checking as it can be async process?

So my thinking was that we allow it for Reactive Routes, but consider it "illegal state" when it comes in other cases. What should be done I'm not sure. Can we destroy the activate request context ourself and activate a new one? It is enough for cases where concrete state is kept, but not enough when someone simply calls requestContext.terminate(). It should be simply illegal to activate request context without termination so early. WDYT @geoand ?

From my observations in this reproducer, the deactivation and destruction of context is handled by Vertx RouteHandler prior to the login implemented in AbstractResteasyReactiveContext which then does no-op in this regard

+1

manovotn commented 4 months ago

lgtm, but tests will be needed. I can prepare them.

That would be awesome, thank you. I didn't mean to propose this as a final fix, just what I came up with while discovering what's going on :)

My thinking was that we should not accept that users activate CDI request context.

There are CDI APIs allowing you to programmatically activate req. context and even more of them in Arc itself. However, I don't think you'd be able to get that far without first activating it from within some other Quarkus part.

... but we can't really keep checking if someone destroyed active context during the processing inside Quarkus REST, can we? Wouldn't it require loop constantly checking as it can be async process?

Yea, if you go on tempering with context activation and destruction forcefully, you are IMO on your own. I don't think we should assume it's being manipulated with; doing so would break user app expectations anyway (i.e. state of existing user defined beans for example).

So my thinking was that we allow it for Reactive Routes, but consider it "illegal state" when it comes in other cases. What should be done I'm not sure. Can we destroy the activate request context ourself and activate a new one? It is enough for cases where concrete state is kept, but not enough when someone simply calls requestContext.terminate(). It should be simply illegal to activate request context without termination so early. WDYT @geoand ?

I'd say it is generally easier to respect if the context is already active when your framework/integration needs it and just reuse that one. Whoever creates the context is also responsible for its destruction. My original question was more along the lines of whether we can be sure that the context that we get here will outlast the window in which we need it to work.

geoand commented 4 months ago

Let me try and summarize so as to check if I have understood the issue correctly:

There is some piece of code that runs in a Vertx route that runs before Quarkus REST and that route activates the CDI context which then Quarkus REST completely ignores and starts a new one? Also, I have not understood from the discussion what has changed between 3.8 and 3.9.

michalvavrik commented 4 months ago

(in short :-)) @manovotn I agree with your conclusions / answers. Thank you.

My original question was more along the lines of whether we can be sure that the context that we get here will outlast the window in which we need it to work.

So far I believe we can in case of Reactive Routes.

There is some piece of code that runs in a Vertx route that runs before Quarkus REST and that route activates the CDI context which then Quarkus REST completely ignores and starts a new one?

yes

Also, I have not understood from the discussion what has changed between 3.8 and 3.9.

I think it comes from refactoring of the EagerSecurityHandler as if you move request context activation few lines above, it works. Problem is that we can't just accept that is the cause. There is a race. It would make for a fragile code that noone can touch. That's why I asked for help from CDI experts to find a root cause.

geoand commented 4 months ago

It's definitely possible to use the existing request scoped state if it exists, but is that really something we want to do in all cases?

manovotn commented 4 months ago

It's definitely possible to use the existing request scoped state if it exists, but is that really something we want to do in all cases?

Can you think of a case where that would be an issue? I would put it the other way and rather say that we shouldn't forcibly destroy previous context and put up a new one. That can easily break expectations of other code (be it user or extension) that activated it.

geoand commented 4 months ago

Can you think of a case where that would be an issue?

Not off hand

geoand commented 4 months ago

What I think can become problematic is how we handle the end of the request processing in Quarkus REST.

Currently as we always start the context, we always terminate it as well. If however now we move to not starting the context, what is the expectation going to be about closing it?

geoand commented 4 months ago

40388 is a draft that reuses the request context which you folks can use to test.

manovotn commented 4 months ago

What I think can become problematic is how we handle the end of the request processing in Quarkus REST.

Currently as we always start the context, we always terminate it as well. If however now we move to not starting the context, what is the expectation going to be about closing it?

That it is handled by whoever started it. Which is the case that I observed in this reproducer as well (see last bit of https://github.com/quarkusio/quarkus/issues/40307#issuecomment-2083249168)

https://github.com/quarkusio/quarkus/pull/40388 is a draft that reuses the request context which you folks can use to test.

+1, this is very similar to what I proposed earlier (https://github.com/manovotn/quarkus/commit/7923d226a5b1aac0862e7c6055f450842b34f689) and to what Michal has (https://github.com/quarkusio/quarkus/commit/bf7880ab3). Although I'd prefer if the activateInitial method actually does unconditional context activation but that's just minor comment :)

geoand commented 4 months ago

That it is handled by whoever started it

Theoretically yes, but is that actually the case in practice. If so, we also need to update the patch to not destroy the context

manovotn commented 4 months ago

That it is handled by whoever started it

Theoretically yes, but is that actually the case in practice. If so, we also need to update the patch to not destroy the context

A repeated invocation of context destruction is a no-op (as is the case in the reproducer). I see your point though; we could instead alter it so that if we detect an active context, we don't even attempt to destroy it. We should be able to do that through the requestScopeActivated boolean attribute with some additional checks in place. I could take a look tomorrow as I am not in front of my laptop now (bank holiday here).

geoand commented 4 months ago

Right, that's my point.

I'll update my draft PR soon

geoand commented 4 months ago

I see your point though; we could instead alter it so that if we detect an active context, we don't even attempt to destroy it. We should be able to do that through the requestScopeActivated boolean attribute with some additional checks in place

I didn't really find a way to do that, because I don't see how I can attach anything to the RequestContextState

manovotn commented 4 months ago

I see your point though; we could instead alter it so that if we detect an active context, we don't even attempt to destroy it. We should be able to do that through the requestScopeActivated boolean attribute with some additional checks in place

I didn't really find a way to do that, because I don't see how I can attach anything to the RequestContextState

@geoand you cannot attach anything to it. We should be able to use the property inside AbstractResteasyReactiveContext to track whether the context originated here or elsewhere and perform cleanup conditionally. Here's what I mean - https://github.com/quarkusio/quarkus/pull/40408

geoand commented 4 months ago

We can do that I guess