knative / eventing

Event-driven application platform for Kubernetes
https://knative.dev/docs/eventing
Apache License 2.0
1.42k stars 598 forks source link

Event POST on URL of MT channel based Broker is timing out #6833

Closed Sidhardha95 closed 1 year ago

Sidhardha95 commented 1 year ago

Describe the bug In one of our clusters , when we are posting events using MT channel based broker URL , we are seeing timeouts for the request.

Expected behavior Event to be successfully posted and accepted by the broker.

Knative release version 1.6.0

Additional context

  1. We are using MT channel based broker , and a Kafka Channel as our implementation

Broker Setup :

apiVersion: eventing.knative.dev/v1
kind: Broker
metadata:
  annotations:
    eventing.knative.dev/broker.class: MTChannelBasedBroker
    eventing.knative.dev/creator: system:serviceaccount:dx-system:fabric-operator
    eventing.knative.dev/lastModifier: system:serviceaccount:dx-system:fabric-operator
  creationTimestamp: "2023-02-25T03:05:28Z"
  generation: 1
  labels:
    dx4c.managed: "true"
    eventing.knative.dev/broker.class: MTChannelBasedBroker
  name: default
  namespace: dx-test
  resourceVersion: "117749284"
  uid: 2bf414ec-f72c-46b0-80f1-5e8cdd6a4804
spec:
  config:
    apiVersion: v1
    kind: ConfigMap
    name: config-kafka
    namespace: knative-eventing
  delivery:
    backoffDelay: PT15S
    backoffPolicy: linear
    retry: 28800

Error in our logs suggest that there is read timeout after waiting for 10 minutes

2023-03-27 07:08:47.846+0000 | POST | :Apache-HttpClient/4.5.13 (Java/11.0.18.0.1):dx-test:FAB-8a01371d-126e-4303-8d63-dc960eee675f | nioEventLoopGroup-4- |  |  INFO | com.oracle.cx.communications.dx4c.fabric.notification.NotificationClient | Before Post to broker [http://broker-ingress.knative-eventing.svc.cluster.local/dx-test/default].
2023-03-27 07:09:05.023+0000 |  | ::: |     helidon-client-0 | SYS | ERROR | com.oracle.cx.communications.dx4c.fabric.utils.core.error.ApplicationException | Application Exception 
com.oracle.cx.communications.dx4c.fabric.utils.core.error.ApplicationUntrackedException: Server encountered an internal error
        at com.oracle.cx.communications.dx4c.fabric.notification.NotificationClient.lambda$sendToBroker$2(NotificationClient.java:94)
        at io.helidon.common.reactive.CompletionAwaitable.lambda$exceptionallyAccept$37(CompletionAwaitable.java:311)
        at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
        at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
        at io.helidon.common.reactive.SingleToFuture.onError(SingleToFuture.java:74)
        at io.helidon.common.reactive.SingleFromPublisher$SingleSubscriber.onError(SingleFromPublisher.java:75)
        at io.helidon.webclient.WebClientRequestBuilderImpl$1.lambda$onError$2(WebClientRequestBuilderImpl.java:642)
        at io.helidon.common.context.Contexts.runInContext(Contexts.java:117)
        at io.helidon.webclient.WebClientRequestBuilderImpl$1.onError(WebClientRequestBuilderImpl.java:642)
        at io.helidon.common.reactive.DeferredScalarSubscription.error(DeferredScalarSubscription.java:132)
        at io.helidon.common.reactive.MultiFromCompletionStage$CompletionStageSubscription.accept(MultiFromCompletionStage.java:70)
        at io.helidon.common.reactive.MultiFromCompletionStage$CompletionStageSubscription.accept(MultiFromCompletionStage.java:52)
        at io.helidon.common.reactive.MultiFromCompletionStage$AtomicBiConsumer.accept(MultiFromCompletionStage.java:95)
        at io.helidon.common.reactive.MultiFromCompletionStage$AtomicBiConsumer.accept(MultiFromCompletionStage.java:88)
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
        at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
        at io.helidon.webclient.NettyClientHandler.exceptionCaught(NettyClientHandler.java:272)
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
        at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273)
        at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireExceptionCaught(CombinedChannelDuplexHandler.java:424)
        at io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:92)
        at io.netty.channel.CombinedChannelDuplexHandler$1.fireExceptionCaught(CombinedChannelDuplexHandler.java:145)
        at io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:143)
        at io.netty.channel.CombinedChannelDuplexHandler.exceptionCaught(CombinedChannelDuplexHandler.java:231)
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
        at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273)
        at io.netty.handler.logging.LoggingHandler.exceptionCaught(LoggingHandler.java:214)
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
        at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273)
        at io.helidon.webclient.HelidonReadTimeoutHandler.readTimedOut(HelidonReadTimeoutHandler.java:35)
        at io.netty.handler.timeout.ReadTimeoutHandler.channelIdle(ReadTimeoutHandler.java:90)
        at io.netty.handler.timeout.IdleStateHandler$ReaderIdleTimeoutTask.run(IdleStateHandler.java:506)
        at io.netty.handler.timeout.IdleStateHandler$AbstractIdleTask.run(IdleStateHandler.java:478)
        at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
        at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
        at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:503)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:995)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.helidon.common.context.Contexts.runInContext(Contexts.java:117)
        at io.helidon.common.context.ContextAwareExecutorImpl.lambda$wrap$7(ContextAwareExecutorImpl.java:154)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.lang.Thread.run(Thread.java:834)
        at com.oracle.svm.core.thread.PlatformThreads.threadStartRoutine(PlatformThreads.java:705)
        at com.oracle.svm.core.posix.thread.PosixPlatformThreads.pthreadStartRoutine(PosixPlatformThreads.java:202)
Caused by: java.util.concurrent.CompletionException: io.helidon.webclient.HelidonReadTimeoutHandler$ReadTimeoutException: Read timeout after 600000 millis on socket /10.244.0.93:47586
        at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367)
        at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376)
        at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019)
        ... 37 more
Caused by: io.helidon.webclient.HelidonReadTimeoutHandler$ReadTimeoutException: Read timeout after 600000 millis on socket /10.244.0.93:47586
        at io.helidon.webclient.HelidonReadTimeoutHandler.readTimedOut(HelidonReadTimeoutHandler.java:37)
        ... 18 more
Sidhardha95 commented 1 year ago

Attaching logs from mt-broker-ingress pod

mt-broker-ingress_03Apr23.log

There are continuous warning messages in logs saying invalid request method . Can you please help us with debugging the issue.

Sidhardha95 commented 1 year ago
(dx4c_control_env) [opc@cxif-podd-eu-zurich-1-57bc4d8554-88b65 esattestcaccterzs]$ kubectl get po -n knative-eventing
NAME                                    READY   STATUS    RESTARTS   AGE
eventing-controller-8bbbc57cd-vrl82     1/1     Running   0          54d
eventing-webhook-c45b466bb-dtdpd        1/1     Running   0          54d
kafka-ch-controller-5f5bc4bc46-2qf66    1/1     Running   0          54d
kafka-ch-dispatcher-65bcf8b49f-l2nnq    1/1     Running   0          54d
kafka-webhook-64496566b4-9vdhk          1/1     Running   0          54d
mt-broker-controller-597ffcb5c7-66jt9   1/1     Running   0          54d
mt-broker-filter-77ff4b6b76-ftvbj       1/1     Running   0          54d
mt-broker-ingress-658bcd8488-nmjc6      1/1     Running   0          54d
pingsource-mt-adapter-89849b49d-j5w8x   1/1     Running   0          54d

State of Knative eventing pods in the reported cluster

pierDipi commented 1 year ago

Does it always happen? or is it sometimes only?

Sidhardha95 commented 1 year ago

We have many different clusters with same kind of setup , not happening in all of them..

But once issues starts happening , we are seeing similar errors in broker-ingress

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.