Knative Trigger Possible Forced Replay?

nkreiger commented 10 months ago

Describe the bug

I want to be transparent, I am by no means a kafka expert, I am trying to better understand what I could have possibly done to achieve this error.

What I did was re-apply a ton of triggers with new names, so that I could clear out the old ones. However, the second I applied them it seems like they started streaming thousands of events all at once, that could have possibly caused the memory error. Resulting in a replay of the receiver which also seems to start sending events from the beginning of the retention period.

{"@timestamp":"2024-01-18T23:56:22.254Z","@version":"1","message":"In flight records exceeds max.poll.records waiting for response from subscriber before polling for new records max.poll.records=50 records=50 context={topics=[knative-broker-d-staging], consumerGroup='knative-trigger-d-og-snyk-sast-37', reference=uuid: \"a7f4006f-2baf-4f4b-9f64-4867b83ae7e5\"\nnamespace: \"d\"\nname: \"og-snyk-sast-37\"\n}","logger_name":"dev.knative.eventing.kafka.broker.dispatcher.impl.consumer.UnorderedConsumerVerticle","thread_name":"vert.x-worker-thread-14","level":"INFO","level_value":20000,"max.poll.records":50,"records":50,"context":{}}
{"@timestamp":"2024-01-18T23:56:22.255Z","@version":"1","message":"In flight records exceeds max.poll.records waiting for response from subscriber before polling for new records max.poll.records=50 records=79 context={topics=[knative-broker-d-default], consumerGroup='knative-trigger-d-blackduck-0', reference=uuid: \"38e7891c-2e74-4cba-aa34-f4b29ece59bf\"\nnamespace: \"d\"\nname: \"blackduck-0\"\n}","logger_name":"dev.knative.eventing.kafka.broker.dispatcher.impl.consumer.UnorderedConsumerVerticle","thread_name":"vert.x-worker-thread-7","level":"INFO","level_value":20000,"max.poll.records":50,"records":79,"context":{}}
{"@timestamp":"2024-01-18T23:56:22.256Z","@version":"1","message":"In flight records exceeds max.poll.records waiting for response from subscriber before polling for new records max.poll.records=50 records=66 context={topics=[knative-broker-d-default], consumerGroup='knative-trigger-d-sigstore-tools-0', reference=uuid: \"1a190480-add6-4b3d-9020-a0b4986c565b\"\nnamespace: \"d\"\nname: \"sigstore-tools-0\"\n}","logger_name":"dev.knative.eventing.kafka.broker.dispatcher.impl.consumer.UnorderedConsumerVerticle","thread_name":"vert.x-worker-thread-18","level":"INFO","level_value":20000,"max.poll.records":50,"records":66,"context":{}}
{"@timestamp":"2024-01-18T23:56:22.263Z","@version":"1","message":"In flight records exceeds max.poll.records waiting for response from subscriber before polling for new records max.poll.records=50 records=52 context={topics=[knative-broker-d-default], consumerGroup='knative-trigger-d-snyk-sast-0', reference=uuid: \"d914661f-da89-43ae-b610-54f6c146b441\"\nnamespace: \"d\"\nname: \"snyk-sast-0\"\n}","logger_name":"dev.knative.eventing.kafka.broker.dispatcher.impl.consumer.UnorderedConsumerVerticle","thread_name":"vert.x-worker-thread-6","level":"INFO","level_value":20000,"max.poll.records":50,"records":52,"context":{}}
{"@timestamp":"2024-01-18T23:56:22.046Z","@version":"1","message":"In flight records exceeds max.poll.records waiting for response from subscriber before polling for new records max.poll.records=50 records=81 context={topics=[knative-broker-d-staging], consumerGroup='knative-trigger-d-og-session-start-24', reference=uuid: \"6d32d100-3567-42e7-9c4a-0493b2f2aa6d\"\nnamespace: \"d\"\nname: \"og-session-start-24\"\n}","logger_name":"dev.knative.eventing.kafka.broker.dispatcher.impl.consumer.UnorderedConsumerVerticle","thread_name":"vert.x-worker-thread-1","level":"INFO","level_value":20000,"max.poll.records":50,"records":81,"context":{}}
{"@timestamp":"2024-01-18T23:56:22.377Z","@version":"1","message":"In flight records exceeds max.poll.records waiting for response from subscriber before polling for new records max.poll.records=50 records=56 context={topics=[knative-broker-d-default], consumerGroup='knative-trigger-d-helm-0', reference=uuid: \"eb8d7a7a-8955-46e9-8ca5-969f243cddd5\"\nnamespace: \"d\"\nname: \"helm-0\"\n}","logger_name":"dev.knative.eventing.kafka.broker.dispatcher.impl.consumer.UnorderedConsumerVerticle","thread_name":"vert.x-worker-thread-4","level":"INFO","level_value":20000,"max.poll.records":50,"records":56,"context":{}}
{"@timestamp":"2024-01-18T23:56:22.378Z","@version":"1","message":"In flight records exceeds max.poll.records waiting for response from subscriber before polling for new records max.poll.records=50 records=55 context={topics=[knative-broker-d-default], consumerGroup='knative-trigger-d-helm-0', reference=uuid: \"eb8d7a7a-8955-46e9-8ca5-969f243cddd5\"\nnamespace: \"d\"\nname: \"helm-0\"\n}","logger_name":"dev.knative.eventing.kafka.broker.dispatcher.impl.consumer.UnorderedConsumerVerticle","thread_name":"vert.x-worker-thread-4","level":"INFO","level_value":20000,"max.poll.records":50,"records":55,"context":{}}
{"@timestamp":"2024-01-18T23:56:22.378Z","@version":"1","message":"In flight records exceeds max.poll.records waiting for response from subscriber before polling for new records max.poll.records=50 records=54 context={topics=[knative-broker-d-default], consumerGroup='knative-trigger-d-helm-0', reference=uuid: \"eb8d7a7a-8955-46e9-8ca5-969f243cddd5\"\nnamespace: \"d\"\nname: \"helm-0\"\n}","logger_name":"dev.knative.eventing.kafka.broker.dispatcher.impl.consumer.UnorderedConsumerVerticle","thread_name":"vert.x-worker-thread-4","level":"INFO","level_value":20000,"max.poll.records":50,"records":54,"context":{}}
{"@timestamp":"2024-01-18T23:56:22.356Z","@version":"1","message":"In flight records exceeds max.poll.records waiting for response from subscriber before polling for new records max.poll.records=50 records=50 context={topics=[knative-broker-d-default], consumerGroup='knative-trigger-d-sigstore-tools-5', reference=uuid: \"d52f87ca-c202-4386-8173-3df7056cb5dd\"\nnamespace: \"d\"\nname: \"sigstore-tools-5\"\n}","logger_name":"dev.knative.eventing.kafka.broker.dispatcher.impl.consumer.UnorderedConsumerVerticle","thread_name":"vert.x-worker-thread-8","level":"INFO","level_value":20000,"max.poll.records":50,"records":50,"context":{}}
{"@timestamp":"2024-01-18T23:56:22.378Z","@version":"1","message":"In flight records exceeds max.poll.records waiting for response from subscriber before polling for new records max.poll.records=50 records=53 context={topics=[knative-broker-d-default], consumerGroup='knative-trigger-d-helm-0', reference=uuid: \"eb8d7a7a-8955-46e9-8ca5-969f243cddd5\"\nnamespace: \"d\"\nname: \"helm-0\"\n}","logger_name":"dev.knative.eventing.kafka.broker.dispatcher.impl.consumer.UnorderedConsumerVerticle","thread_name":"vert.x-worker-thread-4","level":"INFO","level_value":20000,"max.poll.records":50,"records":53,"context":{}}
[65.270s][warning][gc,alloc] vert.x-kafka-consumer-thread-35: Retried waiting for GCLocker too often allocating 256 words
[65.272s][warning][gc,alloc] vert.x-kafka-consumer-thread-115: Retried waiting for GCLocker too often allocating 256 words
[65.273s][warning][gc,alloc] vert.x-kafka-consumer-thread-213: Retried waiting for GCLocker too often allocating 256 words
[65.273s][warning][gc,alloc] kafka-coordinator-heartbeat-thread | knative-trigger-d-whsrc-containerscan: Retried waiting for GCLocker too often allocating 256 words
[65.299s][warning][gc,alloc] vert.x-kafka-consumer-thread-32: Retried waiting for GCLocker too often allocating 4724 words
Aborting due to java.lang.OutOfMemoryError: Java heap space
[65.300s][warning][gc,alloc] vert.x-kafka-consumer-thread-113: Retried waiting for GCLocker too often allocating 256 words
[65.300s][warning][gc,alloc] vert.x-kafka-consumer-thread-152: Retried waiting for GCLocker too often allocating 256 words
[65.300s][warning][gc,alloc] vert.x-kafka-consumer-thread-9: Retried waiting for GCLocker too often allocating 256 words
[65.300s][warning][gc,alloc] kafka-producer-network-thread | producer-39: Retried waiting for GCLocker too often allocating 256 words
[65.300s][warning][gc,alloc] vert.x-eventloop-thread-0: Retried waiting for GCLocker too often allocating 256 words
[65.300s][warning][gc,alloc] kafka-coordinator-heartbeat-thread | knative-trigger-d-fianu-core-dictator-1: Retried waiting for GCLocker too often allocating 256 words
[65.300s][warning][gc,alloc] vert.x-kafka-consumer-thread-17: Retried waiting for GCLocker too often allocating 256 words
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (debug.cpp:362), pid=1, tid=172
#  fatal error: OutOfMemory encountered: Java heap space
#
# JRE version: OpenJDK Runtime Environment (Red_Hat-17.0.9.0.9-1) (17.0.9+9) (build 17.0.9+9-LTS)
# Java VM: OpenJDK 64-Bit Server VM (Red_Hat-17.0.9.0.9-1) (17.0.9+9-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Core dump will be written. Default location: /core.%e.1.%t
#
# An error report file with more information is saved as:
# /tmp/hs_err_pid1.log
[65.311s][warning][gc,alloc] vert.x-kafka-consumer-thread-198: Retried waiting for GCLocker too often allocating 256 words
[65.312s][warning][gc,alloc] vert.x-kafka-consumer-thread-184: Retried waiting for GCLocker too often allocating 748 words
[65.312s][warning][gc,alloc] kafka-coordinator-heartbeat-thread | knative-trigger-d-helm-0: Retried waiting for GCLocker too often allocating 256 words
[65.312s][warning][gc,alloc] kafka-producer-network-thread | producer-184: Retried waiting for GCLocker too often allocating 256 words
[65.312s][warning][gc,alloc] kafka-coordinator-heartbeat-thread | knative-trigger-prod-fianulabs-og-container-asset-attestations-sbom-17: Retried waiting for GCLocker too often allocating 256 words
[65.312s][warning][gc,alloc] vert.x-worker-thread-9: Retried waiting for GCLocker too often allocating 256 words
[65.312s][warning][gc,alloc] vert.x-kafka-consumer-thread-37: Retried waiting for GCLocker too often allocating 256 words
[65.313s][warning][gc,alloc] kafka-producer-network-thread | producer-205: Retried waiting for GCLocker too often allocating 256 words
[65.312s][warning][gc,alloc] vert.x-worker-thread-3: Retried waiting for GCLocker too often allocating 256 words
[65.313s][warning][gc,alloc] vert.x-worker-thread-2: Retried waiting for GCLocker too often allocating 256 words
[65.314s][warning][gc,alloc] vert.x-kafka-consumer-thread-59: Retried waiting for GCLocker too often allocating 75749 words
[65.315s][warning][gc,alloc] vert.x-kafka-consumer-thread-225: Retried waiting for GCLocker too often allocating 256 words
[65.315s][warning][gc,alloc] micrometer-kafka-metrics: Retried waiting for GCLocker too often allocating 256 words
[65.316s][warning][gc,alloc] vert.x-kafka-consumer-thread-197: Retried waiting for GCLocker too often allocating 6970 words
[65.316s][warning][gc,alloc] micrometer-kafka-metrics: Retried waiting for GCLocker too often allocating 256 words
[65.316s][warning][gc,alloc] kafka-producer-network-thread | producer-131: Retried waiting for GCLocker too often allocating 256 words
[65.316s][warning][gc,alloc] kafka-producer-network-thread | producer-10: Retried waiting for GCLocker too often allocating 256 words
[65.317s][warning][gc,alloc] kafka-coordinator-heartbeat-thread | knative-trigger-prod-fianulabs-og-session-end-20: Retried waiting for GCLocker too often allocating 256 words
[65.317s][warning][gc,alloc] vert.x-kafka-consumer-thread-165: Retried waiting for GCLocker too often allocating 22861 words

[error occurred during error reporting (), id 0xb, SIGSEGV (0xb) at pc=0x00007b84a5c82898]

Expected behavior

Applying new triggers would start at the latest offset.

To Reproduce

Working on it.

Knative release version

v1.12

additional context

Hoping to better understand if this is a bug or an ignorance of the underlying infrastructure and how it works.

pierDipi commented 10 months ago

Thanks for reporting @nkreiger, apparently there is an open issue on the OpenJDK project https://bugs.openjdk.org/browse/JDK-8192647, can you share a little bit information regarding:

memory usage and limits of the pods in question
which Knative Kafka APIs you're using
how many Knative objects you roughly have
anything else you think you may share that could impact resource usage
what was the first Knative version you installed in that cluster

nkreiger commented 10 months ago

Hi @pierDipi I have 220 triggers, 6 brokers right now.

Resources aren't defined right now on the kafka-broker-dispatcher, so I assume it would scale as needed, or until I guess the Node ran out of memory?

This is the first knative version v1.12 its a new setup.

My retention period is 1 week, and I accidentally applied 60 duplicate triggers, this possibly caused the crash, resulting in a restart, and it looks like the dispatcher sent all events from the last week.

Do you know how to set the dispatcher to read from the latest offset, instead of firing all events retained?

github-actions[bot] commented 7 months ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

knative-extensions / eventing-kafka-broker

Knative Trigger Possible Forced Replay? #3616