knative-extensions / eventing-kafka

Kafka integrations with Knative Eventing.
Apache License 2.0
77 stars 82 forks source link

Dispatcher Unready/Events being Dropped/Size Limit Not being Respected #1473

Closed nkreiger closed 6 months ago

nkreiger commented 6 months ago

Having a multitude of issues after upgrading to the latest version 1.3.8.

Describe the bug

kafka-broker-dispatcher-8694446b46-gxwfc   0/1     Running                  0          8m36s
kafka-broker-dispatcher-8694446b46-l8v76   0/1     ContainerStatusUnknown   1          43m
kafka-broker-dispatcher-8694446b46-ll9xd   0/1     ContainerStatusUnknown   1          43m
kafka-broker-dispatcher-8694446b46-qmvb2   1/1     Running                  0          4m26s
kafka-broker-dispatcher-8694446b46-rqf2l   0/1     Running                  0          8m36s
kafka-broker-dispatcher-8694446b46-vj58d   1/1     Running                  0          2m18s

Metrics endpoint is continuously failing, despite nodes being high, but still having room in CPU and memory

~ » k top nodes
NAME                                           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
gke-fianu-prod-fianu-node-pool-ef36cf4e-gjhf   2716m        69%    10440Mi         78%       
gke-fianu-prod-fianu-node-pool-ef36cf4e-jlcx   2468m        62%    13899Mi         104%      
gke-fianu-prod-fianu-node-pool-ef36cf4e-vtcr   3977m        45%   12308Mi         92%       
gke-fianu-prod-fianu-node-pool-ef36cf4e-wsdw   2518m        64%    9113Mi          68%    

Seeing lots of:

org.apache.kafka.common.errors.RecordTooLargeException: The request included a message larger than the max message size the server will accept
{"@timestamp":"2024-04-25T17:47:43.004Z","@version":"1","message":"Failed to produce record path=/dxcm/default","logger_name":"dev.knative.eventing.kafka.broker.receiver.impl.handler.IngressRequestHandlerImpl","thread_name":"vert.x-eventloop-thread-2","level":"WARN","level_value":30000,"stack_trace":"org.apache.kafka.common.errors.TimeoutException: Topic knative-broker-dxcm-default not present in metadata after 60000 ms.\n","path":"/dxcm/default"}
{"@timestamp":"2024-04-25T18:32:34.27Z","@version":"1","message":"Failed to send record topic=knative-broker-d-default {}","logger_name":"dev.knative.eventing.kafka.broker.receiver.impl.handler.IngressRequestHandlerImpl","thread_name":"vert.x-eventloop-thread-0","level":"ERROR","level_value":40000,"stack_trace":"org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for knative-broker-d-default-7:120000 ms has passed since batch creation\n","topic":"knative-broker-dxcm-default"}
{"@timestamp":"2024-04-25T18:32:34.27Z","@version":"1","message":"Failed to produce record path=/d/default","logger_name":"dev.knative.eventing.kafka.broker.receiver.impl.handler.IngressRequestHandlerImpl","thread_name":"vert.x-eventloop-thread-0","level":"WARN","level_value":30000,"stack_trace":"org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for knative-broker-d-default-7:120000 ms has passed since batch creation\n","path":"/d/default"}
{"@timestamp":"2024-04-25T18:32:37.025Z","@version":"1","message":"[Producer clientId=producer-1] Disconnecting from node 2 due to socket connection setup timeout. The timeout value is 31485 ms.","logger_name":"org.apache.kafka.clients.NetworkClient","thread_name":"kafka-producer-network-thread | producer-1","level":"INFO","level_value":20000}

Expected behavior

No error messages in the logs, max size would be respected of 20 MB.

To Reproduce

Upgraded from 1.2 to 1.3.8

Knative release version

1.3.8

Additional context

Maybe some of these logs were already here in the past, but providing everything I see to better understand where I need to diagnose.

Connecting to a dedicated Kafka instance running in Confluent Kafka. Could it be that the topics need to be externally managed?