Closed harshad16 closed 2 years ago
/triage accepted /priority critical-urgent /sig devsecops
2022-05-25 12:59:27,761 ERROR [KafkaServer id=0] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) [main]
org.apache.kafka.common.KafkaException: Failed to acquire lock on file .lock in /var/lib/kafka/data/kafka-log0. A Kafka instance in another process or thread is using this directory.
The amq streams have been updated one of the Github threads suggested that CR updates sometime cause such issues. need to check on what were the breaking changes in the this version
The reason for failure is https://access.redhat.com/documentation/en-us/red_hat_amq_streams/2.1/html/deploying_and_upgrading_amq_streams_on_openshift/assembly-upgrade-str#assembly-upgrading-kafka-versions-str
Both the inter.broker.protocol.version and the log.message.format.version. | A single rolling update. After the update, the inter.broker.protocol.version must be updated manually, followed by log.message.format.version. Changing each will trigger a further rolling update.
-- | --
it is stated that log format version and protocol version needs to be updated manually
Upgrading is happening:
2022-05-25 19:11:46 INFO ClusterOperator:123 - Triggering periodic reconciliation for namespace aicoe
2022-05-25 19:11:46 INFO AbstractOperator:226 - Reconciliation #360(timer) Kafka(aicoe/thoth): Kafka thoth will be checked for creation or modification
2022-05-25 19:11:46 WARN AbstractConfiguration:137 - Reconciliation #360(timer) Kafka(aicoe/thoth): Configuration option "ssl.endpoint.identification.algorithm" is forbidden and will be ignored
2022-05-25 19:11:46 INFO KafkaAssemblyOperator:1038 - Reconciliation #360(timer) Kafka(aicoe/thoth): Kafka is upgrading from 3.0.0 to 3.1.0
2022-05-25 19:11:46 INFO KafkaAssemblyOperator:975 - Reconciliation #360(timer) Kafka(aicoe/thoth): Kafka upgrade from 3.0.0 to 3.1.0 requires no change in Zookeeper version
2022-05-25 19:11:47 INFO ZookeeperLeaderFinder:209 - Reconciliation #360(timer) Kafka(aicoe/thoth): Pod thoth-zookeeper-0 is not a leader
2022-05-25 19:11:47 INFO ZookeeperLeaderFinder:206 - Reconciliation #360(timer) Kafka(aicoe/thoth): Pod thoth-zookeeper-2 is leader
2022-05-25 19:11:47 INFO ZooKeeperRoller:133 - Reconciliation #360(timer) Kafka(aicoe/thoth): Rolling pod thoth-zookeeper-0 due to [Pod has old generation]
2022-05-25 19:11:47 INFO PodOperator:54 - Reconciliation #360(timer) Kafka(aicoe/thoth): Rolling pod thoth-zookeeper-0
2022-05-25 19:11:47 INFO OperatorWatcher:38 - Reconciliation #361(watch) Kafka(aicoe/thoth): Kafka thoth in namespace aicoe was MODIFIED
The deployment is fixed: https://console-openshift-console.apps.ocp4.prod.psi.redhat.com/k8s/ns/aicoe/pods
The investigator is not able to resolve some of the workloads. Due to:
4|1653679087.256|OFFSET|rdkafka#consumer-2| [thrd:main]: ocp4-stage.thoth.adviser-trigger [0]: offset reset (at offset 549) to END: fetch failed due to requested offset not available on the broker: Broker: Offset out of range
%4|1653679087.256|OFFSET|rdkafka#consumer-2| [thrd:main]: ocp4-stage.thoth.investigator.unresolved-package [0]: offset reset (at offset 1346844) to END: fetch failed due to requested offset not available on the broker: Broker: Offset out of range
%4|1653679087.256|OFFSET|rdkafka#consumer-2| [thrd:main]: ocp4-stage.thoth.investigator.unrevsolved-package [0]: offset reset (at offset 1249710) to END: fetch failed due to requested offset not available on the broker: Broker: Offset out of range
It seems while updating, the offset of some topics was hindered. It would have happened due to not following the exact steps of the upgrade.
Now this is working. closing this
Describe the bug The team is facing an issue where requesting action on the stage cluster doesn't create any workflows. On further investigation, it was noticed, that the Kafka messages are not getting consumed, due to the issue in the Kafka deployment. At the stage, our own team manages our own kafka deployment here
The kafka worker nodes are failing, investigate further on the issue and fix the pipeline.