odpi / egeria

Egeria core
https://egeria-project.org
Apache License 2.0
806 stars 261 forks source link

Failure to initialize cohort correctly on initial startup (kafka/topics issue?) #2119

Closed planetf1 closed 4 years ago

planetf1 commented 4 years ago

Changes required to support the CTS notebook in the lab containerized environments - docker-compose & k8s a) Data limit to be increased b) 'pandas' module needs adding c) ctsServerURL needs defining

planetf1 commented 4 years ago

The CTS notebook requires a higher than default data limit.

This is referred to in the text as jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10

For the docker-compose & k8s environments this should be done in our templates so the user needs to do nothing

For the local environment it would be useful if the notebook could check this parm, though that may not be feasible

Fix in master only (from #2118)

planetf1 commented 4 years ago

Step 14 of the cts notebook fails with

ModuleNotFoundError: No module named 'pandas'

There is a note that this needs installing - but I'd guess many people using our notebooks won't know so much about notebooks/modules so a little more explanation may help

More importantly we need to ensure we have pandas available in the docker compose & k8s environment

target fix for master (1.3) only.

planetf1 commented 4 years ago

Step 2 of the cts notebook fails due to ctsPlatformURL = os.environ.get('ctsPlatformURL','http://localhost:8080')

since no 'ctsPlatformURL' exists.

An easy workaround is to use ctsPlatformURL = os.environ.get('corePlatformURL','http://localhost:8080')

in the notebook - since it is the same platform as normally seen on localhost:8080

Worth considering, is whether we wish to keep the cts testing away from the other servers & platforms being used for other tutorials. I'd be inclined to do this @grahamwallis

planetf1 commented 4 years ago

(Note: repurposed this issue to aggregate all the 'required' changes to get CTS at least running in k8s/compose)

planetf1 commented 4 years ago

Having experimented with this a little more

Furthermore for coco Pharma, if we look at the topology presented in the diagrams, 'cocoCohort; is the production cohort used by the datalake and other systems

If the org is looking at introducing a new connector, and testing, it will be the dev cohort that they are using.

For this reason I have opted to use the dev platform (for setting ctsPlatform) and further, propose to change the default cohort name used to devCohort instead.

I feel this is most consistent with cocos Policies :-)

It also avoids contamination from all the test events whirling around the environment from impacting their production datalake - or in our case other notebooks running against the datalake.

planetf1 commented 4 years ago

Looks like in the compose environment I'm hitting some kafka related issues:

kafka_1      | [2020-01-06 18:21:14,379] WARN [SocketServer brokerId=1001] Unexpected error from /172.22.0.1; closing connection (org.apache.kafka.common.network.Selector)
kafka_1      | org.apache.kafka.common.network.InvalidReceiveException: Invalid receive (size = 1212498244 larger than 104857600)
kafka_1      |  at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:104)
kafka_1      |  at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:424)
kafka_1      |  at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:385)
kafka_1      |  at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:651)
kafka_1      |  at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:572)
kafka_1      |  at org.apache.kafka.common.network.Selector.poll(Selector.java:483)
kafka_1      |  at kafka.network.Processor.poll(SocketServer.scala:890)
kafka_1      |  at kafka.network.Processor.run(SocketServer.scala:789)
kafka_1      |  at java.lang.Thread.run(Thread.java:748)
planetf1 commented 4 years ago

that seems like some 1.2GB messages

planetf1 commented 4 years ago

no real differences between k8s & compose for kafka config (they use the same docker image). could be timing /memory/permission related in terms of container environment....

planetf1 commented 4 years ago

Am going to complete the initial PR as this gets the basics in place and fixes k8s.

Leaving issue open to address the compose environment

planetf1 commented 4 years ago

This is still occuring

Though the CTS appears to launch, the CTS server never sees the SUT and reports:

Mon Jan 20 12:23:25 GMT 2020 CTS_Server Information CONFORMANCE-SUITE-0008 The Open Metadata Repository Conformance Workbench repository-workbench is waiting for server SUT_Server to join the cohort

Looking at the cohort, /cohort-descriptions shows that both CTS and SUT believe they are part of devCohort. Yet looking at /remote-members neither sees the other.

In the same deployme, once started. the existing servers do appear to see each other, and indeed once started both CTS and SUT server can see these other cohort members too

So fundamentally cohorts are working (and hence kafka) but something is wrong with the registration of the cts/sut server in this environment

planetf1 commented 4 years ago

From the logs, it would appear that neither SUT_Server nor CTS_Server are sending the intiial 'new registration request' - so neither hear about each other.

As soon as cocoMDS1,2,3,4,5,6 etc start they all send their 'new registration request's , which then stimulates SUT_Server and CTS_Serve to send 're-registration' requests

May be something wrong with the initial cohort setup here....

grahamwallis commented 4 years ago

It would be worth checking the output of the CTS and SUT server logs carefully - if the configurations are consistent, Kafka is available and both servers are joining the devCohort, but failing to register, it could be because of a stale server->metdataCollectionId association. If that's the case it will be logged during server startup.

planetf1 commented 4 years ago

With only CTS active:

  ~ cat /tmp/dev | grep -y Registration
Mon Jan 20 13:13:15 GMT 2020 CTS_Server Information OMRS-AUDIT-0062 Requesting registration information from other members of the open metadata repository cohort devCohort
Mon Jan 20 13:13:18 GMT 2020 SUT_Server Information OMRS-AUDIT-0062 Requesting registration information from other members of the open metadata repository cohort devCohort
➜  ~ cat /tmp/core | grep -y Registration
➜  ~ cat /tmp/core | grep -y Registration
➜  ~ less /tmp/dev
➜  ~ cat /tmp/dev | grep devCohort
Mon Jan 20 13:13:14 GMT 2020 CTS_Server Information OMRS-AUDIT-0005 Connecting to cohort devCohort
Mon Jan 20 13:13:14 GMT 2020 CTS_Server Information OMRS-AUDIT-0029 The devCohort cohort inbound event manager is initializing
Mon Jan 20 13:13:14 GMT 2020 CTS_Server Information OMRS-AUDIT-0030 Registering the devCohort event consumer with the local repository outbound event manager
Mon Jan 20 13:13:14 GMT 2020 CTS_Server Information OMRS-AUDIT-0030 Registering the Local Repository Content (TypeDef) Manager event consumer with the devCohort cohort inbound event manager
Mon Jan 20 13:13:14 GMT 2020 CTS_Server Information OMRS-AUDIT-0030 Registering the Local Repository Inbound Instance Events event consumer with the devCohort cohort inbound event manager
Mon Jan 20 13:13:14 GMT 2020 CTS_Server Information OMRS-AUDIT-0030 Registering the Cohort to Enterprise event consumer with the devCohort cohort inbound event manager
Mon Jan 20 13:13:14 GMT 2020 CTS_Server Information OMRS-AUDIT-0031 The devCohort cohort inbound event manager is starting with 2 type definition event consumer(s) and 2 instance event consumer(s)
Mon Jan 20 13:13:14 GMT 2020 CTS_Server Information OMRS-AUDIT-0026 Initializing listener for cohort devCohort
Mon Jan 20 13:13:14 GMT 2020 CTS_Server Information OMRS-AUDIT-0019 The OMRS Topic Connector TopicConnector.Cohort.devCohort has registered with an event bus connector connected to topic egeria.omag.openmetadata.repositoryservices.cohort.devCohort.OMRSTopic
Mon Jan 20 13:13:14 GMT 2020 CTS_Server Information OCF-KAFKA-TOPIC-CONNECTOR-0001 Connecting to Apache Kafka Topic egeria.omag.openmetadata.repositoryservices.cohort.devCohort.OMRSTopic with a server identifier of 6c873f81-2df7-46aa-a372-6b298d38b9b9
Mon Jan 20 13:13:15 GMT 2020 CTS_Server Information OCF-KAFKA-TOPIC-CONNECTOR-0003 8 properties passed to the Apache Kafka Consumer for topic egeria.omag.openmetadata.repositoryservices.cohort.devCohort.OMRSTopic
Mon Jan 20 13:13:15 GMT 2020 CTS_Server Information OCF-KAFKA-TOPIC-CONNECTOR-0002 8 properties passed to the Apache Kafka Producer for topic egeria.omag.openmetadata.repositoryservices.cohort.devCohort.OMRSTopic
Mon Jan 20 13:13:15 GMT 2020 CTS_Server Information OMRS-AUDIT-0020 The OMRS Topic Connector TopicConnector.Cohort.devCohort is ready to send and receive events
Mon Jan 20 13:13:15 GMT 2020 CTS_Server Information OCF-KAFKA-TOPIC-CONNECTOR-0010 The Apache Kafka producer for topic egeria.omag.openmetadata.repositoryservices.cohort.devCohort.OMRSTopic is starting up with 0 buffered messages
Mon Jan 20 13:13:15 GMT 2020 CTS_Server Information OMRS-AUDIT-0015 The listener thread for an OMRS Topic Connector for topic egeria.omag.openmetadata.repositoryservices.cohort.devCohort.OMRSTopic has started
Mon Jan 20 13:13:15 GMT 2020 CTS_Server Information OMRS-AUDIT-0060 Registering with open metadata repository cohort devCohort using metadata collection id 165d260d-b84c-4b6c-9ccb-7aac67b10885
Mon Jan 20 13:13:15 GMT 2020 CTS_Server Information OMRS-AUDIT-0062 Requesting registration information from other members of the open metadata repository cohort devCohort
Mon Jan 20 13:13:18 GMT 2020 SUT_Server Information OMRS-AUDIT-0005 Connecting to cohort devCohort
Mon Jan 20 13:13:18 GMT 2020 SUT_Server Information OMRS-AUDIT-0029 The devCohort cohort inbound event manager is initializing
Mon Jan 20 13:13:18 GMT 2020 SUT_Server Information OMRS-AUDIT-0030 Registering the devCohort event consumer with the local repository outbound event manager
Mon Jan 20 13:13:18 GMT 2020 SUT_Server Information OMRS-AUDIT-0030 Registering the Local Repository Content (TypeDef) Manager event consumer with the devCohort cohort inbound event manager
Mon Jan 20 13:13:18 GMT 2020 SUT_Server Information OMRS-AUDIT-0030 Registering the Local Repository Inbound Instance Events event consumer with the devCohort cohort inbound event manager
Mon Jan 20 13:13:18 GMT 2020 SUT_Server Information OMRS-AUDIT-0031 The devCohort cohort inbound event manager is starting with 1 type definition event consumer(s) and 1 instance event consumer(s)
Mon Jan 20 13:13:18 GMT 2020 SUT_Server Information OMRS-AUDIT-0026 Initializing listener for cohort devCohort
Mon Jan 20 13:13:18 GMT 2020 SUT_Server Information OMRS-AUDIT-0019 The OMRS Topic Connector TopicConnector.Cohort.devCohort has registered with an event bus connector connected to topic egeria.omag.openmetadata.repositoryservices.cohort.devCohort.OMRSTopic
Mon Jan 20 13:13:18 GMT 2020 SUT_Server Information OCF-KAFKA-TOPIC-CONNECTOR-0001 Connecting to Apache Kafka Topic egeria.omag.openmetadata.repositoryservices.cohort.devCohort.OMRSTopic with a server identifier of 4014ceea-32b4-4521-a6be-565e05de2e42
Mon Jan 20 13:13:18 GMT 2020 SUT_Server Information OCF-KAFKA-TOPIC-CONNECTOR-0003 8 properties passed to the Apache Kafka Consumer for topic egeria.omag.openmetadata.repositoryservices.cohort.devCohort.OMRSTopic
Mon Jan 20 13:13:18 GMT 2020 SUT_Server Information OCF-KAFKA-TOPIC-CONNECTOR-0002 8 properties passed to the Apache Kafka Producer for topic egeria.omag.openmetadata.repositoryservices.cohort.devCohort.OMRSTopic
Mon Jan 20 13:13:18 GMT 2020 SUT_Server Information OMRS-AUDIT-0020 The OMRS Topic Connector TopicConnector.Cohort.devCohort is ready to send and receive events
Mon Jan 20 13:13:18 GMT 2020 SUT_Server Information OMRS-AUDIT-0015 The listener thread for an OMRS Topic Connector for topic egeria.omag.openmetadata.repositoryservices.cohort.devCohort.OMRSTopic has started
Mon Jan 20 13:13:18 GMT 2020 SUT_Server Information OMRS-AUDIT-0060 Registering with open metadata repository cohort devCohort using metadata collection id 8bd98b91-93ed-4dd6-a5c5-ae041f689572
Mon Jan 20 13:13:18 GMT 2020 SUT_Server Information OCF-KAFKA-TOPIC-CONNECTOR-0010 The Apache Kafka producer for topic egeria.omag.openmetadata.repositoryservices.cohort.devCohort.OMRSTopic is starting up with 0 buffered messages
Mon Jan 20 13:13:18 GMT 2020 SUT_Server Information OMRS-AUDIT-0062 Requesting registration information from other members of the open metadata repository cohort devCohort
planetf1 commented 4 years ago

What is very odd, is that when run locally we do see new registration requests ie:

➜  tutorials git:(master) ✗ cat ~/log | grep devCohort
Mon Jan 20 13:42:25 GMT 2020 CTS_Server Information OMRS-AUDIT-0005 Connecting to cohort devCohort
Mon Jan 20 13:42:25 GMT 2020 CTS_Server Information OMRS-AUDIT-0029 The devCohort cohort inbound event manager is initializing
Mon Jan 20 13:42:25 GMT 2020 CTS_Server Information OMRS-AUDIT-0030 Registering the devCohort event consumer with the local repository outbound event manager
Mon Jan 20 13:42:25 GMT 2020 CTS_Server Information OMRS-AUDIT-0030 Registering the Local Repository Content (TypeDef) Manager event consumer with the devCohort cohort inbound event manager
Mon Jan 20 13:42:25 GMT 2020 CTS_Server Information OMRS-AUDIT-0030 Registering the Local Repository Inbound Instance Events event consumer with the devCohort cohort inbound event manager
Mon Jan 20 13:42:25 GMT 2020 CTS_Server Information OMRS-AUDIT-0030 Registering the Cohort to Enterprise event consumer with the devCohort cohort inbound event manager
Mon Jan 20 13:42:25 GMT 2020 CTS_Server Information OMRS-AUDIT-0031 The devCohort cohort inbound event manager is starting with 2 type definition event consumer(s) and 2 instance event consumer(s)
Mon Jan 20 13:42:25 GMT 2020 CTS_Server Information OMRS-AUDIT-0026 Initializing listener for cohort devCohort
Mon Jan 20 13:42:25 GMT 2020 CTS_Server Information OMRS-AUDIT-0019 The OMRS Topic Connector TopicConnector.Cohort.devCohort has registered with an event bus connector connected to topic egeria.omag.openmetadata.repositoryservices.cohort.devCohort.OMRSTopic
Mon Jan 20 13:42:25 GMT 2020 CTS_Server Information OCF-KAFKA-TOPIC-CONNECTOR-0001 Connecting to Apache Kafka Topic egeria.omag.openmetadata.repositoryservices.cohort.devCohort.OMRSTopic with a server identifier of 3761e775-9b0c-4d0b-9ad9-042a106e30a7
Mon Jan 20 13:42:25 GMT 2020 CTS_Server Information OCF-KAFKA-TOPIC-CONNECTOR-0003 8 properties passed to the Apache Kafka Consumer for topic egeria.omag.openmetadata.repositoryservices.cohort.devCohort.OMRSTopic
Mon Jan 20 13:42:25 GMT 2020 CTS_Server Information OCF-KAFKA-TOPIC-CONNECTOR-0002 8 properties passed to the Apache Kafka Producer for topic egeria.omag.openmetadata.repositoryservices.cohort.devCohort.OMRSTopic
Mon Jan 20 13:42:25 GMT 2020 CTS_Server Information OCF-KAFKA-TOPIC-CONNECTOR-0010 The Apache Kafka producer for topic egeria.omag.openmetadata.repositoryservices.cohort.devCohort.OMRSTopic is starting up with 0 buffered messages
Mon Jan 20 13:42:25 GMT 2020 CTS_Server Information OMRS-AUDIT-0020 The OMRS Topic Connector TopicConnector.Cohort.devCohort is ready to send and receive events
Mon Jan 20 13:42:25 GMT 2020 CTS_Server Information OMRS-AUDIT-0015 The listener thread for an OMRS Topic Connector for topic egeria.omag.openmetadata.repositoryservices.cohort.devCohort.OMRSTopic has started
Mon Jan 20 13:42:25 GMT 2020 CTS_Server Information OMRS-AUDIT-0060 Registering with open metadata repository cohort devCohort using metadata collection id bf3daf54-596c-4904-97fe-3fffff9c4456
Mon Jan 20 13:42:26 GMT 2020 CTS_Server Information OMRS-AUDIT-0062 Requesting registration information from other members of the open metadata repository cohort devCohort
Mon Jan 20 13:42:30 GMT 2020 SUT_Server Information OMRS-AUDIT-0005 Connecting to cohort devCohort
Mon Jan 20 13:42:30 GMT 2020 SUT_Server Information OMRS-AUDIT-0029 The devCohort cohort inbound event manager is initializing
Mon Jan 20 13:42:30 GMT 2020 SUT_Server Information OMRS-AUDIT-0030 Registering the devCohort event consumer with the local repository outbound event manager
Mon Jan 20 13:42:30 GMT 2020 SUT_Server Information OMRS-AUDIT-0030 Registering the Local Repository Content (TypeDef) Manager event consumer with the devCohort cohort inbound event manager
Mon Jan 20 13:42:30 GMT 2020 SUT_Server Information OMRS-AUDIT-0030 Registering the Local Repository Inbound Instance Events event consumer with the devCohort cohort inbound event manager
Mon Jan 20 13:42:30 GMT 2020 SUT_Server Information OMRS-AUDIT-0031 The devCohort cohort inbound event manager is starting with 1 type definition event consumer(s) and 1 instance event consumer(s)
Mon Jan 20 13:42:30 GMT 2020 SUT_Server Information OMRS-AUDIT-0026 Initializing listener for cohort devCohort
Mon Jan 20 13:42:30 GMT 2020 SUT_Server Information OMRS-AUDIT-0019 The OMRS Topic Connector TopicConnector.Cohort.devCohort has registered with an event bus connector connected to topic egeria.omag.openmetadata.repositoryservices.cohort.devCohort.OMRSTopic
Mon Jan 20 13:42:30 GMT 2020 SUT_Server Information OCF-KAFKA-TOPIC-CONNECTOR-0001 Connecting to Apache Kafka Topic egeria.omag.openmetadata.repositoryservices.cohort.devCohort.OMRSTopic with a server identifier of 63768387-724c-454c-8a4e-980bf5f9aec4
Mon Jan 20 13:42:30 GMT 2020 SUT_Server Information OCF-KAFKA-TOPIC-CONNECTOR-0003 8 properties passed to the Apache Kafka Consumer for topic egeria.omag.openmetadata.repositoryservices.cohort.devCohort.OMRSTopic
Mon Jan 20 13:42:30 GMT 2020 SUT_Server Information OCF-KAFKA-TOPIC-CONNECTOR-0002 8 properties passed to the Apache Kafka Producer for topic egeria.omag.openmetadata.repositoryservices.cohort.devCohort.OMRSTopic
Mon Jan 20 13:42:30 GMT 2020 SUT_Server Information OCF-KAFKA-TOPIC-CONNECTOR-0010 The Apache Kafka producer for topic egeria.omag.openmetadata.repositoryservices.cohort.devCohort.OMRSTopic is starting up with 0 buffered messages
Mon Jan 20 13:42:30 GMT 2020 SUT_Server Information OMRS-AUDIT-0015 The listener thread for an OMRS Topic Connector for topic egeria.omag.openmetadata.repositoryservices.cohort.devCohort.OMRSTopic has started
Mon Jan 20 13:42:30 GMT 2020 SUT_Server Information OMRS-AUDIT-0020 The OMRS Topic Connector TopicConnector.Cohort.devCohort is ready to send and receive events
Mon Jan 20 13:42:30 GMT 2020 SUT_Server Information OMRS-AUDIT-0060 Registering with open metadata repository cohort devCohort using metadata collection id 163d9a2d-850f-4ee5-a866-51d202ddbb9f
Mon Jan 20 13:42:30 GMT 2020 SUT_Server Information OMRS-AUDIT-0062 Requesting registration information from other members of the open metadata repository cohort devCohort
Mon Jan 20 13:42:31 GMT 2020 CTS_Server Information OMRS-AUDIT-0110 A new registration request has been received for cohort devCohort from server SUT_Server that hosts metadata collection 163d9a2d-850f-4ee5-a866-51d202ddbb9f
Mon Jan 20 13:42:32 GMT 2020 CTS_Server Information OMRS-AUDIT-0106 Refreshing registration with open metadata repository cohort devCohort using metadata collection id bf3daf54-596c-4904-97fe-3fffff9c4456 at the request of server SUT_Server
Mon Jan 20 13:42:34 GMT 2020 SUT_Server Information OMRS-AUDIT-0112 A re-registration request has been received for cohort devCohort from server CTS_Server that hosts metadata collection bf3daf54-596c-4904-97fe-3fffff9c4456
grahamwallis commented 4 years ago

Surprising that even in the second (local) example there is no evidence of the repository-workbench starting.

planetf1 commented 4 years ago

It does start -- the tests run. Easy to confirm - computer gets hot and noisy ... ;-)

planetf1 commented 4 years ago

Ok so if - in one of the previous failing environments, one starts the CTS server, then pauses a 10-20s before starting SUT the conformance tests work fine.

This ensures that we don't try and start SUT until after the cts has sent it's registration AND is listening

I note that in the containers

So we could be in fact not looking at a cts specific issue, but rather a general timing issue with kafka. Whilst it could be a configuration problem, the kafka setup is using the bitnami configuration - as used by many people/services in cloud...

grahamwallis commented 4 years ago

Are you running more servers than just the CTS and SUT? Most (all?) test runs tend to be just those two. If there are more servers joining and registering maybe we need to allow extra time?

planetf1 commented 4 years ago

Just the opposite.

If there are other servers running we don't seem to see this problem.

One thought that's crossed my mind from looking at the logs. When running 'clean' the OMRS topic doesn't exist. So the first time we touch it will be when we try and activate SUTServer. It takes a small amount of time before it exists. Very small, but enough to change the timing of any listeners - who can't listen until the topic exists? I think kafka delays briefly in this case.

We wouldn't see this locally since once the topic is created it will just stay there

Just a hunch..

planetf1 commented 4 years ago

a) If this is to do with topic creation / kafka metadata update a small delay (say 2s) should keep the notebook reliable

b) We should try and understand the dynamics, in case it affects a production egeria environment. However if it's proven to be topic creation related it could be fair to assume that isn't a typical production environment so we're ok

c) We still need to document the topics we require clearly for deployers ( I did attempt this at https://github.com/odpi/egeria/blob/e674ec973fc5a475348edf500c763cd7bd62054a/open-metadata-implementation/adapters/open-connectors/event-bus-connectors/open-metadata-topic-connectors/kafka-open-metadata-topic-connector/README.md but it's not that consumable. And whilst we have an API to list topics, that's a bit too late...)

mandy-chessell commented 4 years ago

It could be related to the problem we have with the topic connector loosing the registration event if kafka is down. In a busy cluster, registration refresh requests abound and one lost registration event is no problem. In the CTS test case, if the registration events are lost then neither talks to the other.

planetf1 commented 4 years ago

Yes it could.. and in the coco pharma case it's busy so not a problem

I hate to suggest this, but Pragmatically a 2s pause seems to work reliably. So I propose to make it 4s (individual's pcs may vary in speed). This enables others to use the CTS without trouble (I hope)

but we need to understand it and we should revert once we've addressed the underlying issue -- need to get back to the kafka issue @wbittles #1876 - this one I think may be different though - but in the same area, needing a better understanding of kafka's behaviour (and ours). I don't believe the kafka client gives an exception, it will just pause slightly. That may not be what we expect. In any case we need to examine.

I'll raise a PR for the temporary mitigation and retitle this issue accordingly.

wbittles commented 4 years ago

@planetf1 There are a lot of questions raised by this issue so I agree we need to fully understand what's required an what's actually going on. Some of the questions I have from just this issue.

1 Was there an actual message that was too large to send ?. I would expect this to be constant in all environments. If there is no message, it implies, something else is spamming that TCP port. For instance I've seen this caused by posting a http request to a tcp/ip listener. Is python involved in this situation ?

  1. I noticed that the actual original exception was thrown by a receive call , and then it closes the tcp/ip connection. Mu understanding of this version of the code is that the producer (if real) would be an infinite loop trying to resend, but now getting a different original exception, one from the transport layer. I would expect to see the logs of the producer filled by these retries (1 record for every 20 attempts)

If you can send me the Egeria and Kafka debug I can try and work out what's happening at the kafka level ?

planetf1 commented 4 years ago

Agreed in discussion that #2119 will aim to redesign the interaction with kafka in a cleaner more resiliant way, and in addition #2475 will add an extra level of checking in case registration is lost

As such, closing.