kafbat / kafka-ui

Open-Source Web UI for managing Apache Kafka clusters
http://ui.docs.kafbat.io
Apache License 2.0
609 stars 80 forks source link

SSO occasional Connection reset error #469

Open ognjenVlad opened 4 months ago

ognjenVlad commented 4 months ago

Issue submitter TODO list

Describe the bug (actual behavior)

When using SSO, web-ui sometimes returns 500. Hard to find out when exactly, but seems like it happens when coming back to web-ui after some time, although sometimes it happens when you just try to login randomly. It is always redirection to redirect-uri login/oauth2/code/{client}

{"code":5000,"message":"Connection reset","timestamp":1720450398132,"requestId":"e25340f5-656","fieldsErrors":null,"stackTrace":"org.springframework.web.reactive.function.client.WebClientRequestException: Connection reset\n\tat org.springframework.web.reactive.function.client.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:136)\n\tSuppressed: The stacktrace has been enhanced by Reactor, refer to additional information below: \nError has been observed at the following site(s):\n\t*__checkpoint ⇢ Request to POST https://***/token [DefaultWebClient]\n\t*__checkpoint ⇢ OAuth2LoginAuthenticationWebFilter [DefaultWebFilterChain]\n\t*__checkpoint ⇢ OAuth2AuthorizationRequestRedirectWebFilter [DefaultWebFilterChain]\n\t*__checkpoint ⇢ ReactorContextWebFilter [DefaultWebFilterChain]\n\t*__checkpoint ⇢ HttpHeaderWriterWebFilter [DefaultWebFilterChain]\n\t*__checkpoint ⇢ ServerWebExchangeReactorContextWebFilter [DefaultWebFilterChain]\n\t*__checkpoint ⇢ org.springframework.security.web.server.WebFilterChainProxy [DefaultWebFilterChain]\n\t*__checkpoint ⇢ org.springframework.web.filter.reactive.ServerHttpObservationFilter [DefaultWebFilterChain]\n\t*__checkpoint ⇢ HTTP GET \"/login/oauth2/code/test?code=***\" [ExceptionHandlingWebHandler]\nOriginal Stack Trace:\n\t\tat org.springframework.web.reactive.function.client.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:136)\n\t\tat reactor.core.publisher.MonoErrorSupplied.subscribe(MonoErrorSupplied.java:55)\n\t\tat reactor.core.publisher.Mono.subscribe(Mono.java:4495)\n\t\tat reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:103)\n\t\tat reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)\n\t\tat reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)\n\t\tat reactor.core.publisher.FluxPeek$PeekSubscriber.onError(FluxPeek.java:222)\n\t\tat reactor.core.publisher.MonoNext$NextSubscriber.onError(MonoNext.java:93)\n\t\tat reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain.onError(MonoFlatMapMany.java:204)\n\t\tat reactor.core.publisher.SerializedSubscriber.onError(SerializedSubscriber.java:124)\n\t\tat reactor.core.publisher.FluxRetryWhen$RetryWhenMainSubscriber.whenError(FluxRetryWhen.java:225)\n\t\tat reactor.core.publisher.FluxRetryWhen$RetryWhenOtherSubscriber.onError(FluxRetryWhen.java:274)\n\t\tat reactor.core.publisher.FluxContextWrite$ContextWriteSubscriber.onError(FluxContextWrite.java:121)\n\t\tat reactor.core.publisher.FluxConcatMapNoPrefetch$FluxConcatMapNoPrefetchSubscriber.maybeOnError(FluxConcatMapNoPrefetch.java:326)\n\t\tat reactor.core.publisher.FluxConcatMapNoPrefetch$FluxConcatMapNoPrefetchSubscriber.onNext(FluxConcatMapNoPrefetch.java:211)\n\t\tat reactor.core.publisher.FluxContextWrite$ContextWriteSubscriber.onNext(FluxContextWrite.java:107)\n\t\tat reactor.core.publisher.SinkManyEmitterProcessor.drain(SinkManyEmitterProcessor.java:471)\n\t\tat reactor.core.publisher.SinkManyEmitterProcessor$EmitterInner.drainParent(SinkManyEmitterProcessor.java:615)\n\t\tat reactor.core.publisher.FluxPublish$PubSubInner.request(FluxPublish.java:873)\n\t\tat reactor.core.publisher.FluxContextWrite$ContextWriteSubscriber.request(FluxContextWrite.java:136)\n\t\tat reactor.core.publisher.FluxConcatMapNoPrefetch$FluxConcatMapNoPrefetchSubscriber.request(FluxConcatMapNoPrefetch.java:336)\n\t\tat reactor.core.publisher.FluxContextWrite$ContextWriteSubscriber.request(FluxContextWrite.java:136)\n\t\tat reactor.core.publisher.Operators$DeferredSubscription.request(Operators.java:1717)\n\t\tat reactor.core.publisher.FluxRetryWhen$RetryWhenMainSubscriber.onError(FluxRetryWhen.java:192)\n\t\tat reactor.core.publisher.MonoCreate$DefaultMonoSink.error(MonoCreate.java:201)\n\t\tat reactor.netty.http.client.HttpClientConnect$HttpObserver.onUncaughtException(HttpClientConnect.java:403)\n\t\tat reactor.netty.ReactorNetty$CompositeConnectionObserver.onUncaughtException(ReactorNetty.java:703)\n\t\tat reactor.netty.resources.DefaultPooledConnectionProvider$DisposableAcquire.onUncaughtException(DefaultPooledConnectionProvider.java:223)\n\t\tat reactor.netty.resources.DefaultPooledConnectionProvider$PooledConnection.onUncaughtException(DefaultPooledConnectionProvider.java:476)\n\t\tat reactor.netty.channel.FluxReceive.drainReceiver(FluxReceive.java:247)\n\t\tat reactor.netty.channel.FluxReceive.onInboundError(FluxReceive.java:468)\n\t\tat reactor.netty.channel.ChannelOperations.onInboundError(ChannelOperations.java:515)\n\t\tat reactor.netty.channel.ChannelOperationsHandler.exceptionCaught(ChannelOperationsHandler.java:145)\n\t\tat io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:346)\n\t\tat io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:325)\n\t\tat io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:317)\n\t\tat io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireExceptionCaught(CombinedChannelDuplexHandler.java:424)\n\t\tat io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:92)\n\t\tat io.netty.channel.CombinedChannelDuplexHandler$1.fireExceptionCaught(CombinedChannelDuplexHandler.java:145)\n\t\tat io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:143)\n\t\tat io.netty.channel.CombinedChannelDuplexHandler.exceptionCaught(CombinedChannelDuplexHandler.java:231)\n\t\tat io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:346)\n\t\tat io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:325)\n\t\tat io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:317)\n\t\tat io.netty.handler.ssl.SslHandler.exceptionCaught(SslHandler.java:1204)\n\t\tat io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:346)\n\t\tat io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:325)\n\t\tat io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:317)\n\t\tat io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1377)\n\t\tat io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:346)\n\t\tat io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:325)\n\t\tat io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:907)\n\t\tat io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:125)\n\t\tat io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:177)\n\t\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)\n\t\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)\n\t\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)\n\t\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)\n\t\tat io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)\n\t\tat io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\t\tat io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\t\tat java.base/java.lang.Thread.run(Thread.java:840)\nCaused by: java.net.SocketException: Connection reset\n\tat java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394)\n\tat java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:426)\n\tat io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:255)\n\tat io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)\n\tat io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:357)\n\tat io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)\n\tat io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat java.base/java.lang.Thread.run(Thread.java:840)\n"}

Expected behavior

Successful login every time we use SSO

Your installation details

  1. 4de0d53
  2. `kafka: clusters:

    • name: test bootstrapServers: "" logging: level: "debug" io.kafbat.ui: DEBUG org.springframework.http.codec.json.Jackson2JsonEncoder: DEBUG org.springframework.http.codec.json.Jackson2JsonDecoder: DEBUG reactor.netty.http.server.AccessLog: DEBUG org.springframework.security: DEBUG auth: type: OAUTH2 oauth2: client: test: clientId: "" clientSecret: "" scope: ["openid", "email", "groups"] client-name: github authorization-grant-type: authorization_code authorization-uri: https:///auth redirect-uri: https://***/login/oauth2/code/test provider: "" user-name-attribute: email token-uri: https:///token issuer-uri: https://***/ custom-params: type: oauth roles-field: groups rbac: roles:`

Steps to reproduce

Since it happens really randomly it is hard to provide steps to reproduce, it happens sometimes when using SSO with custom OIDC and Github. Tried using older versions as well, with every version it happened.

Screenshots

No response

Logs

No response

Additional context

No response

github-actions[bot] commented 4 months ago

Hi ognjenVlad! 👋

Welcome, and thank you for opening your first issue in the repo!

Please wait for triaging by our maintainers.

As development is carried out in our spare time, you can support us by sponsoring our activities or even funding the development of specific issues. Sponsorship link

If you plan to raise a PR for this issue, please take a look at our contributing guide.

Haarolean commented 4 months ago

which oidc/oauth provider do you use?

ognjenVlad commented 4 months ago

Dex https://github.com/dexidp/dex, which doesn't log any errors, authentication is succesful. Just 500 Connection reset occurs on kafbat side.

Haarolean commented 4 months ago

Could you please try adding this env var SERVER_REACTIVE_SESSION_TIMEOUT: "86400" and observing if there are any changes?

ognjenVlad commented 4 months ago

It is still happening, but seems like it is harder to reproduce if that makes sense. Thanks

Haarolean commented 3 months ago

We'd need a minimal reproducible example to be able to reproduce and fix this (if there's anything to fix). Feel free to use our keycloak setup example if needed.

kapybro[bot] commented 3 months ago

Further user feedback is requested. Please reply within 7 days or we might close the issue.

kapybro[bot] commented 3 months ago

No feedback received within 7 days. Auto closing.

sashaozz commented 2 months ago

This looks like your authorization server might be behind some LoadBalancer which can silently drop idle connections after timeout. For example, behind AWS NLB, which has timeout of 350 seconds. This is a known problem (e.g. see https://github.com/reactor/reactor-netty/issues/1774) with clients which will pool TCP connections like netty which is used in kafka-ui.

In this case you can try to fix it by instructing netty to remove pooled connections before timeout. E.g. by setting env var JAVA_OPTS : -Dreactor.netty.pool.maxIdleTime=30000 -Dreactor.netty.pool.maxLifeTime=60000

ognjenVlad commented 1 month ago

This actually fixed it! @sashaozz Thank you very much