odpi / egeria

Egeria core
https://egeria-project.org
Apache License 2.0
806 stars 260 forks source link

Unable to run egeria demo on first startup, Kafka issue #3395

Closed CDaRip2U closed 4 years ago

CDaRip2U commented 4 years ago

Trying to start up the demo today ran into new issue with kafka.

"The resulting exception of org.odpi.openmetadata.frameworks.connectors.ffdc.ConnectorCheckedException included the following message: OCF-KAFKA-TOPIC-CONNECTOR-400-002 Egeria was unable to initialize a connection to a Kafka cluster. The message in the exception was: Failed to create new KafkaAdminClient"

I am running with the egeria:2.2-snapshot. CTRL-C, stop services and restart seems to have addressed the issue, but this seems to be consistent running a clean environment.

CDaRip2U commented 4 years ago

I am on windows 10, build 2004 with docker utilizing WSL2.

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 910620944ef2 odpi/jupyter:2.2-SNAPSHOT "tini -g -- start.sh…" 23 minutes ago Up 8 minutes 0.0.0.0:18888->8888/tcp tutorials_notebook_1 4b9fe49d1336 odpi/egeria:2.2-SNAPSHOT "/entrypoint.sh /bin…" 23 minutes ago Up 8 minutes 5005/tcp, 9443/tcp, 0.0.0.0:18443->8443/tcp tutorials_ui_1 5682a8009c9c odpi/egeria:2.2-SNAPSHOT "/entrypoint.sh /bin…" 23 minutes ago Up 8 minutes 5005/tcp, 0.0.0.0:19444->9443/tcp tutorials_datalake_1 895393648a04 odpi/egeria:2.2-SNAPSHOT "/entrypoint.sh /bin…" 23 minutes ago Up 8 minutes 5005/tcp, 0.0.0.0:19445->9443/tcp tutorials_dev_1 9de410f84180 odpi/egeria:2.2-SNAPSHOT "/entrypoint.sh /bin…" 23 minutes ago Up 8 minutes 5005/tcp, 0.0.0.0:19446->9443/tcp tutorials_factory_1 129edb6301f0 odpi/egeria:2.2-SNAPSHOT "/entrypoint.sh /bin…" 23 minutes ago Up 8 minutes 5005/tcp, 0.0.0.0:19443->9443/tcp tutorials_core_1 14170df1889a bitnami/kafka:latest "/opt/bitnami/script…" 23 minutes ago Up 8 minutes 0.0.0.0:19092->9092/tcp tutorials_kafka_1 801050f02516 bitnami/zookeeper:latest "/opt/bitnami/script…" 24 minutes ago Up 8 minutes 2888/tcp, 3888/tcp, 8080/tcp, 0.0.0.0:12181->2181/tcp tutorials_zookeeper_1

as you can see it was created then had to be restarted.

planetf1 commented 4 years ago

The exception is from code in the kafka client that is basically just trying to check kafka is up. - so is it that kafka wasn't up?

You mentioned from the second message that something had to be restarted, but unclear what -- all those containers have been up 8 minutes and started 23-24 mins ago.

I tried the latest 2.2 docker compose and all containers seemed to start ok. I checked the logs from each egeria container and all were clean.

This was on a 16GB linux machine

It's possible it's performance related .. ie your environment cannot start kafka quick enough and one egeria node stops waiting, or that a container is being stopped due to memory constraints.

Is the system under high CPU load during startup? How long does start take? Is there any info in the docker logs that might indicate any kind of problem? How much ram (recommended: 6GB) do you have allocated to docker.

Will try on windows...

planetf1 commented 4 years ago

I repeated this on a windows system of 2014 vintage, 16GB ram running a prerelease windows build (20161) with docker on wsl2, and there too the compose environment started correctly. I didn't observe any audit log events nor restarted. I did note very high cpu as the docker containers were starting & up to 12GB committed ram (I realised the ram statement doesn't apply to wsl2) , but I think this is down to the switch to wsl2 which looks like it may still be being tweaked.

Can you repro on any other environment? Still suspect timing/resoource

planetf1 commented 4 years ago

In both cases I should clarify I was able to run the config & start notebooks & have the servers in the cohort configured together. there are many events in the audit logs, but each time the kafka connector appears to connect on first attempt across all servers.

I would note the windows machine is very sluggish, which seems primarily to be due to the large amount of ram grabbed by wsl2.

planetf1 commented 4 years ago

@CDaRip2U ^^

planetf1 commented 4 years ago

Can you reopen if this re-occurs... I've not seen it and can't reproduce currently. If you do get it I suggest we get together on slack to talk through and debug.