odpi / egeria

Egeria core
https://egeria-project.org
Apache License 2.0
808 stars 261 forks source link

Kafka TimeoutException #5066

Closed cmgrote closed 3 years ago

cmgrote commented 3 years ago

While running the new Performance Test Suite against the Graph repository, I'm hitting a point at which the test seems to just entirely stall. I can see that after about an hour of appearing to be stalled, messages like the following are eventually printed out:

2021-04-19 14:08:22.248  INFO 1 --- [| adminclient-1] o.a.k.c.a.i.AdminMetadataManager         : [AdminClient clientId=adminclient-1] Metadata update failed

org.apache.kafka.common.errors.TimeoutException: Call(callName=fetchMetadata, deadlineMs=1618841235545, tries=1, nextAllowedTryMs=1618841237639) timed out at 1618841237539 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: fetchMetadata

2021-04-19 14:08:22.250  INFO 1 --- [| adminclient-1] o.a.k.c.a.i.AdminMetadataManager         : [AdminClient clientId=adminclient-1] Metadata update failed

org.apache.kafka.common.errors.TimeoutException: Call(callName=fetchMetadata, deadlineMs=1618841267539, tries=1, nextAllowedTryMs=1618841302350) timed out at 1618841302250 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: fetchMetadata

2021-04-19 14:28:54.140  INFO 1 --- [| adminclient-2] o.a.k.c.a.i.AdminMetadataManager         : [AdminClient clientId=adminclient-2] Metadata update failed

org.apache.kafka.common.errors.TimeoutException: Call(callName=fetchMetadata, deadlineMs=1618842531445, tries=1, nextAllowedTryMs=1618842534239) timed out at 1618842534139 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: fetchMetadata

It's not clear from the messages what would be causing a Kafka delay, nor why this only seems to occur in the Graph repository (the same infrastructure has been used to run the performance test suite for other repositories dozens of times without presenting this issue -- it's only the Graph repository that seems to trigger it).

Any ideas? (This is under the 2.8 release itself, not a dev / snapshot version.)

cmgrote commented 3 years ago

Some minutes later, the logged lines above are then followed by lines like the following:

2021-04-19 14:40:20.258  INFO 1 --- [| adminclient-2] o.a.k.c.a.i.AdminMetadataManager         : [AdminClient clientId=adminclient-2] Metadata update failed

org.apache.kafka.common.errors.TimeoutException: Call(callName=fetchMetadata, deadlineMs=1618843203909, tries=1, nextAllowedTryMs=1618843220355) timed out at 1618843220255 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.DisconnectException: Cancelled fetchMetadata request with correlation id 219 due to node 1001 being disconnected

2021-04-19 14:46:17.973  INFO 1 --- [| adminclient-2] o.a.k.c.a.i.AdminMetadataManager         : [AdminClient clientId=adminclient-2] Metadata update failed

org.apache.kafka.common.errors.TimeoutException: Call(callName=fetchMetadata, deadlineMs=1618843574392, tries=1, nextAllowedTryMs=1618843578073) timed out at 1618843577973 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: fetchMetadata

2021-04-19 14:53:58.617  INFO 1 --- [0a-d41a3de0e7d1] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Group coordinator g5-2-kafka-0.g5-2-kafka-headless.default.svc.cluster.local:9092 (id: 2147482646 rack: null) is unavailable or invalid due to cause: session timed out without receiving a heartbeat response.isDisconnected: false. Rediscovery will be attempted.
2021-04-19 14:54:18.185  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Discovered group coordinator g5-2-kafka-0.g5-2-kafka-headless.default.svc.cluster.local:9092 (id: 2147482646 rack: null)

2021-04-19 14:54:52.693 ERROR 1 --- [0a-d41a3de0e7d1] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Offset commit failed on partition egeria.openmetadata.repositoryservices.cohort.coco.OMRSTopic-0 at offset 3960: The coordinator is not aware of this member.
2021-04-19 14:55:08.119  INFO 1 --- [0a-d41a3de0e7d1] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] OffsetCommit failed with Generation{generationId=1, memberId='consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1-683bd666-ac98-4068-8fac-e148ee22d9ce', protocol='range'}: The coordinator is not aware of this member.
2021-04-19 14:55:08.119  INFO 1 --- [0a-d41a3de0e7d1] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Group coordinator g5-2-kafka-0.g5-2-kafka-headless.default.svc.cluster.local:9092 (id: 2147482646 rack: null) is unavailable or invalid due to cause: session timed out without receiving a heartbeat response.isDisconnected: false. Rediscovery will be attempted.

2021-04-19 14:55:47.004  WARN 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Asynchronous auto-commit of offsets {egeria.openmetadata.repositoryservices.cohort.coco.OMRSTopic-0=OffsetAndMetadata{offset=3960, leaderEpoch=0, metadata=''}} failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
2021-04-19 14:55:47.005  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Discovered group coordinator g5-2-kafka-0.g5-2-kafka-headless.default.svc.cluster.local:9092 (id: 2147482646 rack: null)

2021-04-19 14:56:31.165  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Failing OffsetCommit request since the consumer is not part of an active group
2021-04-19 14:56:41.388  WARN 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Synchronous auto-commit of offsets {egeria.openmetadata.repositoryservices.cohort.coco.OMRSTopic-0=OffsetAndMetadata{offset=3960, leaderEpoch=0, metadata=''}} failed: Offset commit cannot be completed since the consumer is not part of an active group for auto partition assignment; it is likely that the consumer was kicked out of the group.
2021-04-19 14:56:41.388  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Giving away all assigned partitions as lost since generation has been reset,indicating that consumer is no longer part of the group
2021-04-19 14:56:41.388  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Lost previously assigned partitions egeria.openmetadata.repositoryservices.cohort.coco.OMRSTopic-0
2021-04-19 14:56:51.292  INFO 1 --- [.coco.OMRSTopic] o.a.e.t.k.KafkaOpenMetadataEventConsumer : Lost partitions in rebalance. Committing current offsets:{egeria.openmetadata.repositoryservices.cohort.coco.OMRSTopic-0=OffsetAndMetadata{offset=3960, leaderEpoch=null, metadata=''}}
2021-04-19 14:57:00.615  INFO 1 --- [6f-c356c1a31c37] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] Group coordinator g5-2-kafka-0.g5-2-kafka-headless.default.svc.cluster.local:9092 (id: 2147482646 rack: null) is unavailable or invalid due to cause: session timed out without receiving a heartbeat response.isDisconnected: false. Rediscovery will be attempted.
2021-04-19 14:57:54.812  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] Discovered group coordinator g5-2-kafka-0.g5-2-kafka-headless.default.svc.cluster.local:9092 (id: 2147482646 rack: null)
2021-04-19 14:58:23.364  INFO 1 --- [| adminclient-1] o.a.k.c.a.i.AdminMetadataManager         : [AdminClient clientId=adminclient-1] Metadata update failed

org.apache.kafka.common.errors.TimeoutException: Call(callName=fetchMetadata, deadlineMs=1618844270045, tries=1, nextAllowedTryMs=1618844284724) timed out at 1618844284624 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: fetchMetadata

2021-04-19 14:58:28.118  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Failing OffsetCommit request since the consumer is not part of an active group
2021-04-19 14:58:13.775  INFO 1 --- [| adminclient-2] o.a.k.c.a.i.AdminMetadataManager         : [AdminClient clientId=adminclient-2] Metadata update failed

org.apache.kafka.common.errors.TimeoutException: Call(callName=fetchMetadata, deadlineMs=1618844284936, tries=1, nextAllowedTryMs=1618844293875) timed out at 1618844293775 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: fetchMetadata

2021-04-19 14:58:43.005  INFO 1 --- [| adminclient-1] o.a.k.c.a.i.AdminMetadataManager         : [AdminClient clientId=adminclient-1] Metadata update failed

org.apache.kafka.common.errors.TimeoutException: Call(callName=fetchMetadata, deadlineMs=1618844314624, tries=1, nextAllowedTryMs=1618844318015) timed out at 1618844317915 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: fetchMetadata

Mon Apr 19 14:58:43 GMT 2021 pts Information OCF-KAFKA-TOPIC-CONNECTOR-0018 The Egeria client was rebalanced by Kafka and failed to commit already consumed events
2021-04-19 14:58:43.006  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] (Re-)joining group
2021-04-19 14:58:43.006 ERROR 1 --- [6f-c356c1a31c37] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] Offset commit failed on partition egeria.openmetadata.repositoryservices.cohort.coco.OMRSTopic-0 at offset 3960: The coordinator is not aware of this member.
2021-04-19 14:58:43.006  INFO 1 --- [6f-c356c1a31c37] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] OffsetCommit failed with Generation{generationId=1, memberId='consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3-0ae4fd8d-a21f-4b9e-9ba4-03d09a8ced63', protocol='range'}: The coordinator is not aware of this member.
2021-04-19 14:58:43.006  INFO 1 --- [6f-c356c1a31c37] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] Group coordinator g5-2-kafka-0.g5-2-kafka-headless.default.svc.cluster.local:9092 (id: 2147482646 rack: null) is unavailable or invalid due to cause: session timed out without receiving a heartbeat response.isDisconnected: false. Rediscovery will be attempted.
2021-04-19 14:58:52.532  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] Attempt to heartbeat with stale Generation{generationId=1, memberId='consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3-0ae4fd8d-a21f-4b9e-9ba4-03d09a8ced63', protocol='range'} and group instance id Optional.empty failed due to UNKNOWN_MEMBER_ID, ignoring the error
2021-04-19 14:59:02.427  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Group coordinator g5-2-kafka-0.g5-2-kafka-headless.default.svc.cluster.local:9092 (id: 2147482646 rack: null) is unavailable or invalid due to cause: null.isDisconnected: true. Rediscovery will be attempted.
2021-04-19 14:59:02.428  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Rebalance failed.

org.apache.kafka.common.errors.DisconnectException: null

2021-04-19 14:59:17.488  WARN 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] Asynchronous auto-commit of offsets {egeria.openmetadata.repositoryservices.cohort.coco.OMRSTopic-0=OffsetAndMetadata{offset=3960, leaderEpoch=0, metadata=''}} failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
2021-04-19 14:59:27.511  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] Discovered group coordinator g5-2-kafka-0.g5-2-kafka-headless.default.svc.cluster.local:9092 (id: 2147482646 rack: null)
2021-04-19 14:59:37.155  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Discovered group coordinator g5-2-kafka-0.g5-2-kafka-headless.default.svc.cluster.local:9092 (id: 2147482646 rack: null)
2021-04-19 14:59:46.766  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] Failing OffsetCommit request since the consumer is not part of an active group
2021-04-19 14:59:51.844  WARN 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] Synchronous auto-commit of offsets {egeria.openmetadata.repositoryservices.cohort.coco.OMRSTopic-0=OffsetAndMetadata{offset=3960, leaderEpoch=0, metadata=''}} failed: Offset commit cannot be completed since the consumer is not part of an active group for auto partition assignment; it is likely that the consumer was kicked out of the group.
2021-04-19 14:59:51.844  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] Giving away all assigned partitions as lost since generation has been reset,indicating that consumer is no longer part of the group
2021-04-19 14:59:51.844  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] Lost previously assigned partitions egeria.openmetadata.repositoryservices.cohort.coco.OMRSTopic-0
2021-04-19 14:59:51.845  INFO 1 --- [.coco.OMRSTopic] o.a.e.t.k.KafkaOpenMetadataEventConsumer : Lost partitions in rebalance. Committing current offsets:{egeria.openmetadata.repositoryservices.cohort.coco.OMRSTopic-0=OffsetAndMetadata{offset=3960, leaderEpoch=null, metadata=''}}
2021-04-19 14:59:51.845  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] Failing OffsetCommit request since the consumer is not part of an active group
Mon Apr 19 14:59:51 GMT 2021 myserver Information OCF-KAFKA-TOPIC-CONNECTOR-0018 The Egeria client was rebalanced by Kafka and failed to commit already consumed events
2021-04-19 14:59:51.845  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] (Re-)joining group
2021-04-19 15:00:01.396  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] (Re-)joining group
2021-04-19 15:00:21.644  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] (Re-)joining group
2021-04-19 15:00:26.639  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] Successfully joined group with generation Generation{generationId=3, memberId='consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3-69d2dbd1-01a9-4c48-b5d5-86c540ee0a27', protocol='range'}
2021-04-19 15:00:26.639  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] Finished assignment for group at generation 3: {consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3-69d2dbd1-01a9-4c48-b5d5-86c540ee0a27=Assignment(partitions=[egeria.openmetadata.repositoryservices.cohort.coco.OMRSTopic-0])}
2021-04-19 15:00:36.330  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] (Re-)joining group
2021-04-19 15:00:50.753  INFO 1 --- [6f-c356c1a31c37] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] Successfully synced group in generation Generation{generationId=3, memberId='consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3-69d2dbd1-01a9-4c48-b5d5-86c540ee0a27', protocol='range'}
2021-04-19 15:00:50.753  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Successfully joined group with generation Generation{generationId=3, memberId='consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1-98b56ee0-0f33-4738-be15-50a6e0cbea52', protocol='range'}
2021-04-19 15:01:00.804  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] Notifying assignor about the new Assignment(partitions=[egeria.openmetadata.repositoryservices.cohort.coco.OMRSTopic-0])
2021-04-19 15:01:00.804  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] Adding newly assigned partitions: egeria.openmetadata.repositoryservices.cohort.coco.OMRSTopic-0
2021-04-19 15:01:05.892  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Finished assignment for group at generation 3: {consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1-98b56ee0-0f33-4738-be15-50a6e0cbea52=Assignment(partitions=[egeria.openmetadata.repositoryservices.cohort.coco.OMRSTopic-0])}
2021-04-19 15:01:21.038  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] Group coordinator g5-2-kafka-0.g5-2-kafka-headless.default.svc.cluster.local:9092 (id: 2147482646 rack: null) is unavailable or invalid due to cause: null.isDisconnected: true. Rediscovery will be attempted.
2021-04-19 15:01:36.588  INFO 1 --- [0a-d41a3de0e7d1] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Successfully synced group in generation Generation{generationId=3, memberId='consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1-98b56ee0-0f33-4738-be15-50a6e0cbea52', protocol='range'}
2021-04-19 15:01:36.588  INFO 1 --- [0a-d41a3de0e7d1] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Group coordinator g5-2-kafka-0.g5-2-kafka-headless.default.svc.cluster.local:9092 (id: 2147482646 rack: null) is unavailable or invalid due to cause: session timed out without receiving a heartbeat response.isDisconnected: false. Rediscovery will be attempted.
2021-04-19 15:01:36.591  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Discovered group coordinator g5-2-kafka-0.g5-2-kafka-headless.default.svc.cluster.local:9092 (id: 2147482646 rack: null)
2021-04-19 15:01:41.669  INFO 1 --- [6f-c356c1a31c37] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] Discovered group coordinator g5-2-kafka-0.g5-2-kafka-headless.default.svc.cluster.local:9092 (id: 2147482646 rack: null)
2021-04-19 15:01:46.771  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Notifying assignor about the new Assignment(partitions=[egeria.openmetadata.repositoryservices.cohort.coco.OMRSTopic-0])
2021-04-19 15:01:46.771  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Adding newly assigned partitions: egeria.openmetadata.repositoryservices.cohort.coco.OMRSTopic-0
2021-04-19 15:02:02.239  INFO 1 --- [0a-d41a3de0e7d1] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Attempt to heartbeat with Generation{generationId=3, memberId='consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1-98b56ee0-0f33-4738-be15-50a6e0cbea52', protocol='range'} and group instance id Optional.empty failed due to UNKNOWN_MEMBER_ID, resetting generation
2021-04-19 15:02:12.313  INFO 1 --- [0a-d41a3de0e7d1] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Group coordinator g5-2-kafka-0.g5-2-kafka-headless.default.svc.cluster.local:9092 (id: 2147482646 rack: null) is unavailable or invalid due to cause: session timed out without receiving a heartbeat response.isDisconnected: false. Rediscovery will be attempted.
2021-04-19 15:02:17.324  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Discovered group coordinator g5-2-kafka-0.g5-2-kafka-headless.default.svc.cluster.local:9092 (id: 2147482646 rack: null)
2021-04-19 15:02:22.373  INFO 1 --- [6f-c356c1a31c37] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] Attempt to heartbeat with Generation{generationId=3, memberId='consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3-69d2dbd1-01a9-4c48-b5d5-86c540ee0a27', protocol='range'} and group instance id Optional.empty failed due to UNKNOWN_MEMBER_ID, resetting generation
2021-04-19 15:02:22.373  INFO 1 --- [6f-c356c1a31c37] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] Group coordinator g5-2-kafka-0.g5-2-kafka-headless.default.svc.cluster.local:9092 (id: 2147482646 rack: null) is unavailable or invalid due to cause: session timed out without receiving a heartbeat response.isDisconnected: false. Rediscovery will be attempted.
2021-04-19 15:02:27.532  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] Attempt to heartbeat with stale Generation{generationId=3, memberId='consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3-69d2dbd1-01a9-4c48-b5d5-86c540ee0a27', protocol='range'} and group instance id Optional.empty failed due to UNKNOWN_MEMBER_ID, ignoring the error
2021-04-19 15:02:27.532  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Giving away all assigned partitions as lost since generation has been reset,indicating that consumer is no longer part of the group
2021-04-19 15:02:27.532  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Lost previously assigned partitions egeria.openmetadata.repositoryservices.cohort.coco.OMRSTopic-0
2021-04-19 15:02:27.532  INFO 1 --- [.coco.OMRSTopic] o.a.e.t.k.KafkaOpenMetadataEventConsumer : Lost partitions in rebalance. Committing current offsets:{egeria.openmetadata.repositoryservices.cohort.coco.OMRSTopic-0=OffsetAndMetadata{offset=3960, leaderEpoch=null, metadata=''}}
2021-04-19 15:02:27.533  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.ConsumerCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] Failing OffsetCommit request since the consumer is not part of an active group
Mon Apr 19 15:02:32 GMT 2021 pts Information OCF-KAFKA-TOPIC-CONNECTOR-0018 The Egeria client was rebalanced by Kafka and failed to commit already consumed events
2021-04-19 15:02:32.640  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-e17e5be0-2503-4c3b-890a-d41a3de0e7d1-1, groupId=e17e5be0-2503-4c3b-890a-d41a3de0e7d1] (Re-)joining group
2021-04-19 15:02:37.652  INFO 1 --- [.coco.OMRSTopic] o.a.k.c.c.internals.AbstractCoordinator  : [Consumer clientId=consumer-8dac1751-9c4e-4dc1-bc6f-c356c1a31c37-3, groupId=8dac1751-9c4e-4dc1-bc6f-c356c1a31c37] Discovered group coordinator g5-2-kafka-0.g5-2-kafka-headless.default.svc.cluster.local:9092 (id: 2147482646 rack: null)

... which then just appears to go on endlessly...

Could this be a result of processing of an event exceeding the poll time / timeouts of Kafka to commit an offset back?

cmgrote commented 3 years ago

and eventually this devolves into an endless stack of these:

2021-04-19 17:50:21.702  WARN 1 --- [pool-6-thread-1] o.j.diskstorage.log.kcvs.KCVSLog         : Could not read messages for timestamp [2021-04-19T17:50:11.381947Z] (this read will be retried)

org.janusgraph.core.JanusGraphException: Could not execute operation due to backend exception
    at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:56) ~[janusgraph-core-0.5.3.jar!/:na]
    at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:158) ~[janusgraph-core-0.5.3.jar!/:na]
    at org.janusgraph.diskstorage.log.kcvs.KCVSLog$MessagePuller.run(KCVSLog.java:725) ~[janusgraph-core-0.5.3.jar!/:na]
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[na:na]
    at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) ~[na:na]
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[na:na]
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[na:na]
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[na:na]
    at java.base/java.lang.Thread.run(Thread.java:834) ~[na:na]
Caused by: org.janusgraph.diskstorage.PermanentBackendException: Could not start BerkeleyJE transaction
    at org.janusgraph.diskstorage.berkeleyje.BerkeleyJEStoreManager.beginTransaction(BerkeleyJEStoreManager.java:163) ~[janusgraph-berkeleyje-0.5.3.jar!/:na]
    at org.janusgraph.diskstorage.berkeleyje.BerkeleyJEStoreManager.beginTransaction(BerkeleyJEStoreManager.java:47) ~[janusgraph-berkeleyje-0.5.3.jar!/:na]
    at org.janusgraph.diskstorage.keycolumnvalue.keyvalue.OrderedKeyValueStoreManagerAdapter.beginTransaction(OrderedKeyValueStoreManagerAdapter.java:68) ~[janusgraph-core-0.5.3.jar!/:na]
    at org.janusgraph.diskstorage.log.kcvs.KCVSLog.openTx(KCVSLog.java:319) ~[janusgraph-core-0.5.3.jar!/:na]
    at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:145) ~[janusgraph-core-0.5.3.jar!/:na]
    at org.janusgraph.diskstorage.util.BackendOperation$1.call(BackendOperation.java:161) ~[janusgraph-core-0.5.3.jar!/:na]
    at org.janusgraph.diskstorage.util.BackendOperation.executeDirect(BackendOperation.java:68) ~[janusgraph-core-0.5.3.jar!/:na]
    at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:54) ~[janusgraph-core-0.5.3.jar!/:na]
    ... 8 common frames omitted
Caused by: com.sleepycat.je.EnvironmentFailureException: (JE 18.3.12) Environment must be closed, caused by: com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 18.3.12) ./data/servers/myserver/repository/graph/berkeley Latch timeout. BIN185981 currentThread: Thread[Checkpointer,5,main] currentTime: 1618854602073 exclusiveOwner: Thread[https-jsse-nio-9443-exec-1,5,main] UNEXPECTED_STATE_FATAL: Unexpected internal state, unable to continue. Environment is invalid and must be closed.
    at com.sleepycat.je.EnvironmentFailureException.wrapSelf(EnvironmentFailureException.java:230) ~[je-18.3.12.jar!/:18.3.12]
    at com.sleepycat.je.dbi.EnvironmentImpl.checkIfInvalid(EnvironmentImpl.java:1835) ~[je-18.3.12.jar!/:18.3.12]
    at com.sleepycat.je.dbi.EnvironmentImpl.checkOpen(EnvironmentImpl.java:1844) ~[je-18.3.12.jar!/:18.3.12]
    at com.sleepycat.je.Environment.checkOpen(Environment.java:2697) ~[je-18.3.12.jar!/:18.3.12]
    at com.sleepycat.je.Environment.beginTransactionInternal(Environment.java:1409) ~[je-18.3.12.jar!/:18.3.12]
    at com.sleepycat.je.Environment.beginTransaction(Environment.java:1383) ~[je-18.3.12.jar!/:18.3.12]
    at org.janusgraph.diskstorage.berkeleyje.BerkeleyJEStoreManager.beginTransaction(BerkeleyJEStoreManager.java:146) ~[janusgraph-berkeleyje-0.5.3.jar!/:na]
    ... 15 common frames omitted
Caused by: com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 18.3.12) ./data/servers/myserver/repository/graph/berkeley Latch timeout. BIN185981 currentThread: Thread[Checkpointer,5,main] currentTime: 1618854602073 exclusiveOwner: Thread[https-jsse-nio-9443-exec-1,5,main] UNEXPECTED_STATE_FATAL: Unexpected internal state, unable to continue. Environment is invalid and must be closed.
    at com.sleepycat.je.EnvironmentFailureException.unexpectedState(EnvironmentFailureException.java:459) ~[je-18.3.12.jar!/:18.3.12]
    at com.sleepycat.je.latch.LatchSupport.handleTimeout(LatchSupport.java:211) ~[je-18.3.12.jar!/:18.3.12]
    at com.sleepycat.je.latch.LatchImpl.acquireExclusive(LatchImpl.java:63) ~[je-18.3.12.jar!/:18.3.12]
    at com.sleepycat.je.latch.LatchImpl.acquireShared(LatchImpl.java:107) ~[je-18.3.12.jar!/:18.3.12]
    at com.sleepycat.je.tree.IN.latchShared(IN.java:563) ~[je-18.3.12.jar!/:18.3.12]
    at com.sleepycat.je.recovery.DirtyINMap.selectDirtyBINChildrenForCheckpoint(DirtyINMap.java:391) ~[je-18.3.12.jar!/:18.3.12]
    at com.sleepycat.je.recovery.DirtyINMap.selectDirtyINsForCheckpoint(DirtyINMap.java:261) ~[je-18.3.12.jar!/:18.3.12]
    at com.sleepycat.je.recovery.Checkpointer.doCheckpoint(Checkpointer.java:826) ~[je-18.3.12.jar!/:18.3.12]
    at com.sleepycat.je.recovery.Checkpointer.onWakeup(Checkpointer.java:572) ~[je-18.3.12.jar!/:18.3.12]
    at com.sleepycat.je.utilint.DaemonThread.run(DaemonThread.java:154) ~[je-18.3.12.jar!/:18.3.12]
    ... 1 common frames omitted

Apr 19, 2021 5:50:21 PM com.sleepycat.je.utilint.DaemonThread run
SEVERE: <DaemonThread name="Checkpointer"/> caught exception, com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 18.3.12) ./data/servers/myserver/repository/graph/berkeley Latch timeout. BIN185981 currentThread: Thread[Checkpointer,5,main] currentTime: 1618854602073 exclusiveOwner: Thread[https-jsse-nio-9443-exec-1,5,main] UNEXPECTED_STATE_FATAL: Unexpected internal state, unable to continue. Environment is invalid and must be closed. Exiting
com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 18.3.12) ./data/servers/myserver/repository/graph/berkeley Latch timeout. BIN185981 currentThread: Thread[Checkpointer,5,main] currentTime: 1618854602073 exclusiveOwner: Thread[https-jsse-nio-9443-exec-1,5,main] UNEXPECTED_STATE_FATAL: Unexpected internal state, unable to continue. Environment is invalid and must be closed.
    at com.sleepycat.je.EnvironmentFailureException.unexpectedState(EnvironmentFailureException.java:459)
    at com.sleepycat.je.latch.LatchSupport.handleTimeout(LatchSupport.java:211)
    at com.sleepycat.je.latch.LatchImpl.acquireExclusive(LatchImpl.java:63)
    at com.sleepycat.je.latch.LatchImpl.acquireShared(LatchImpl.java:107)
    at com.sleepycat.je.tree.IN.latchShared(IN.java:563)
    at com.sleepycat.je.recovery.DirtyINMap.selectDirtyBINChildrenForCheckpoint(DirtyINMap.java:391)
    at com.sleepycat.je.recovery.DirtyINMap.selectDirtyINsForCheckpoint(DirtyINMap.java:261)
    at com.sleepycat.je.recovery.Checkpointer.doCheckpoint(Checkpointer.java:826)
    at com.sleepycat.je.recovery.Checkpointer.onWakeup(Checkpointer.java:572)
    at com.sleepycat.je.utilint.DaemonThread.run(DaemonThread.java:154)
    at java.base/java.lang.Thread.run(Thread.java:834)
planetf1 commented 3 years ago

Looks like some kind of logic error (not handling failure/recovery)-- I've not seen issues on the CTS, but I've only run the default profiles for the 2.8 release. I presume it may have been brought up under load - timeouts - high CPU/disk For a more real-world deployment we wouldn't expect to be using berkelyDB but rather another janusgraph backend (not to say it shouldn't work in this example)

wbittles commented 3 years ago

That's a very strangle value for a a timeout ..looks like some kind of default value. Egeria sets the following by default , but only for the consumer there doesn't seem to be any producer timeouts set whch again is interesting.

auto.commit.interval.ms 1000
session.timeout.ms 30000

I'd advise trying to gather some system counters for CPU,DISK , R+W , /MEM and NETWORK I'd also try and take a core dump of the JVM (s) it could be a some kind of locking issue.

If we look at this example exception we can see this exception is down as a caused by clause, it feels like the egeria logs could be missing an exception from the call stack.

[2021-03-08 15:42:19,595] ERROR Stopping due to error (org.apache.kafka.connect.mirror.MirrorMaker:304) org.apache.kafka.connect.errors.ConnectException: Failed to connect to and describe Kafka cluster. Check worker's broker connection and security properties. at org.apache.kafka.connect.util.ConnectUtils.lookupKafkaClusterId(ConnectUtils.java:70) at org.apache.kafka.connect.util.ConnectUtils.lookupKafkaClusterId(ConnectUtils.java:51) at org.apache.kafka.connect.mirror.MirrorMaker.addHerder(MirrorMaker.java:235) at org.apache.kafka.connect.mirror.MirrorMaker.lambda$new$1(MirrorMaker.java:136) at java.lang.Iterable.forEach(Iterable.java:75) at org.apache.kafka.connect.mirror.MirrorMaker.(MirrorMaker.java:136) at org.apache.kafka.connect.mirror.MirrorMaker.(MirrorMaker.java:148) at org.apache.kafka.connect.mirror.MirrorMaker.main(MirrorMaker.java:291) Caused by: java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Call(callName=listNodes, deadlineMs=1615236139583, tries=1, nextAllowedTryMs=1615236139684) timed out at 1615236139584 after 1 attempt(s) at org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45) at org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32) at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89) at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260) at org.apache.kafka.connect.util.ConnectUtils.lookupKafkaClusterId(ConnectUtils.java:64) ... 7 more Caused by: org.apache.kafka.common.errors.TimeoutException: Call(callName=listNodes, deadlineMs=1615236139583, tries=1, nextAllowedTryMs=1615236139684) timed out at 1615236139584 after 1 attempt(s) Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: listNodes

planetf1 commented 3 years ago

@wbittles you showed a stack from MirrorMaker. Is that an example or did you see it in the original issue above? I was intrigued as I believe MirrorMaker is a standalone replication tool so wasn't aware that we were using in Egeria/for cts testing?

@cmgrote when you say the thread appeared to be stalled.. I wonder if it would be useful to attach to it with the debugger to see what it is up to? Or look at switching on additional tracing/capturing more trace info from jvm

wbittles commented 3 years ago

@planetf1 It's just an example of another kafka client's handling of this kafka exception .how it's typically a low level exception, hence the reference to the "caused by clause". It's strange to see this exception in the Egeria log without the stack trace of the function that received the exception. In addition It also has a timeout value in the same ball park as the one Chris reported .

planetf1 commented 3 years ago

@cmgrote what is the topology at work here - are you running in containers, locally? Could anything have exited due to, for example, memory restrictions in the host (or the container), or was it all local?

cmgrote commented 3 years ago

Topology is a 3-node k8s cluster of virtualised kit (so some potential background contention at play). I co-lo all component pods (connector and its repo, kafka, and zookeeper) into the same node of the cluster and nothing else runs there. Has 8 cores and 32GB memory on the node; main pod is limited to 4 cores and 16GB mem max (requests 2 cores and 8GB) while I believe Kafka and zookeeper request 1-2GB each.

I use exactly the same cluster (and same per-pod requests and limits) to run tests on other repos and it is only ever the graph repo that seems to produce these kinds of issues....

Latest I saw indicated the max.request.size parameter on the Kafka producer was too small, so I've increased that in the config to run the connector and (for now) that seems to have settled things back to a working state.

Still strange that it is only this repo that seems to have these Kafka-related impacts (?)

planetf1 commented 3 years ago

Should be a decent config - good to see you're finding a working approach, any general suggestions may be worth adding to the docs. That being said it's a massive area in it's own right, and I presume in an enterprise deployment, kafka experts will be on hand.

What did you go from/to on the max request size? I think the default is 2MB, which is a fair size. Does this indicate we need to do something in our message handling to split some messages - how big could they get? (since there's always going to be a limit). Do you know what the payload was?

The other aspect to this is to understand the error processing, and can we help the admin in understanding why things or failing and/or whether there are any problems in the error paths that we might take any action on.

I don't underestimate the fact this could be quite tricky though.

cmgrote commented 3 years ago

The default is 1MB, and the logged errors suggested that this was sometimes barely exceeded, and other times hit approximately 2MB. I bumped it to 5MB for the config.

The documentation on the parameter from Kafka's side suggests that it is not just the maximum size of a single record, but that it is used for some sort of batching as well. This raises the possibility in my mind that it may not be as simple as just a single payload that breaches the maximum size, but that there could have been some queuing up that overflowed the batch (though I'd think in that case this would be a bug in Kafka itself, so perhaps is less likely).

No idea what the payload(s) that caused the overflow were -- since they wouldn't have been written to Kafka and weren't written out anywhere in the log. (And no idea (or time) to figure out how I can attach a debugger to a process that runs inside a container that itself is running inside a remote k8s cluster, over private networks, etc... to see what is there in the JVM process itself)

wbittles commented 3 years ago

Typically the initial steps would require data from something like combination of something like Prometheus and the Elastic stack, Is there any chance we could use this for situation to improve the serviceability of Egeria in real production environments.

It does sound like you have some kind of performance tuning issue , but it would be far certain if we had some kind of indication of where the problem or resource contention actually manifests itself.

For what it's worth, it isn't uncommon that a faster performing persistence layer will expose an upstream bottleneck.

planetf1 commented 3 years ago

@wbittles agreed - we could certainly look more at prometheus setup & metrics in our k8s setup. It should be easy to add the basics (and we can expose some data via spring), no doubt kafka has plenty too, though we may at some point want to extend egeria itself to be able to report more metrics in terms of egeria's view of the world.

This should be close to ootb on something like an openshift helm deployment of egeria where the log/search side for the data is already provided by another cloud service

planetf1 commented 3 years ago

@cmgrote I have attached successfully to a container running in a k8s deployment from my local machine. It's pretty much business as usual, ie adding appropriate JVM config, exposing the debug port, ensuring a service is configured so this can be accessed. Any VPN setup is then similarly needed if running private. In fact the core egeria piece is already done in the egeria-odpi-lab chart as config options.

HOWEVER thinking about your problem description, I don't see it's viable here. Debug slows down thte JVM MASSIVELY, plus the volume of data will be huge. I had this issue with the CTS.. I got there eventually but it took SO LONG. As such I think @wbittles suggestion of doing more via logging (and using what we can) makes more sense.

If we can't tackle this issue now I suggest we leave the issue open. Maybe we tag it as a performance issue insofar as it least documents what we know, and anyone concerned on scalability can pitch in to help?

planetf1 commented 3 years ago

@cmgrote Looks like batch.size and linger.ms control batching. There's some info here on the relationship to message sizes https://stackoverflow.com/questions/54193852/is-this-possible-producer-batch-size-max-request-size-broker-max-message-by

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.