thoth-station / thoth-application

Thoth-Station ArgoCD Applications
GNU General Public License v3.0
12 stars 22 forks source link

workflows in stage are not getting scheduled due to issue in kafka #2566

Closed harshad16 closed 2 years ago

harshad16 commented 2 years ago

Describe the bug The team is facing an issue where requesting action on the stage cluster doesn't create any workflows. On further investigation, it was noticed, that the Kafka messages are not getting consumed, due to the issue in the Kafka deployment. At the stage, our own team manages our own kafka deployment here

The kafka worker nodes are failing, investigate further on the issue and fix the pipeline.

harshad16 commented 2 years ago

/triage accepted /priority critical-urgent /sig devsecops

harshad16 commented 2 years ago
2022-05-25 12:59:27,761 ERROR [KafkaServer id=0] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) [main]
org.apache.kafka.common.KafkaException: Failed to acquire lock on file .lock in /var/lib/kafka/data/kafka-log0. A Kafka instance in another process or thread is using this directory.
harshad16 commented 2 years ago

kafka The amq streams have been updated one of the Github threads suggested that CR updates sometime cause such issues. need to check on what were the breaking changes in the this version

harshad16 commented 2 years ago

The reason for failure is https://access.redhat.com/documentation/en-us/red_hat_amq_streams/2.1/html/deploying_and_upgrading_amq_streams_on_openshift/assembly-upgrade-str#assembly-upgrading-kafka-versions-str


Both the inter.broker.protocol.version and the log.message.format.version. | A single rolling update. After the update, the inter.broker.protocol.version must be updated manually, followed by log.message.format.version. Changing each will trigger a further rolling update.
-- | --

it is stated that log format version and protocol version needs to be updated manually

harshad16 commented 2 years ago

Upgrading is happening:

2022-05-25 19:11:46 INFO  ClusterOperator:123 - Triggering periodic reconciliation for namespace aicoe
2022-05-25 19:11:46 INFO  AbstractOperator:226 - Reconciliation #360(timer) Kafka(aicoe/thoth): Kafka thoth will be checked for creation or modification
2022-05-25 19:11:46 WARN  AbstractConfiguration:137 - Reconciliation #360(timer) Kafka(aicoe/thoth): Configuration option "ssl.endpoint.identification.algorithm" is forbidden and will be ignored
2022-05-25 19:11:46 INFO  KafkaAssemblyOperator:1038 - Reconciliation #360(timer) Kafka(aicoe/thoth): Kafka is upgrading from 3.0.0 to 3.1.0
2022-05-25 19:11:46 INFO  KafkaAssemblyOperator:975 - Reconciliation #360(timer) Kafka(aicoe/thoth): Kafka upgrade from 3.0.0 to 3.1.0 requires no change in Zookeeper version
2022-05-25 19:11:47 INFO  ZookeeperLeaderFinder:209 - Reconciliation #360(timer) Kafka(aicoe/thoth): Pod thoth-zookeeper-0 is not a leader
2022-05-25 19:11:47 INFO  ZookeeperLeaderFinder:206 - Reconciliation #360(timer) Kafka(aicoe/thoth): Pod thoth-zookeeper-2 is leader
2022-05-25 19:11:47 INFO  ZooKeeperRoller:133 - Reconciliation #360(timer) Kafka(aicoe/thoth): Rolling pod thoth-zookeeper-0 due to [Pod has old generation]
2022-05-25 19:11:47 INFO  PodOperator:54 - Reconciliation #360(timer) Kafka(aicoe/thoth): Rolling pod thoth-zookeeper-0
2022-05-25 19:11:47 INFO  OperatorWatcher:38 - Reconciliation #361(watch) Kafka(aicoe/thoth): Kafka thoth in namespace aicoe was MODIFIED
harshad16 commented 2 years ago

The deployment is fixed: https://console-openshift-console.apps.ocp4.prod.psi.redhat.com/k8s/ns/aicoe/pods

harshad16 commented 2 years ago

The investigator is not able to resolve some of the workloads. Due to:

4|1653679087.256|OFFSET|rdkafka#consumer-2| [thrd:main]: ocp4-stage.thoth.adviser-trigger [0]: offset reset (at offset 549) to END: fetch failed due to requested offset not available on the broker: Broker: Offset out of range
%4|1653679087.256|OFFSET|rdkafka#consumer-2| [thrd:main]: ocp4-stage.thoth.investigator.unresolved-package [0]: offset reset (at offset 1346844) to END: fetch failed due to requested offset not available on the broker: Broker: Offset out of range
%4|1653679087.256|OFFSET|rdkafka#consumer-2| [thrd:main]: ocp4-stage.thoth.investigator.unrevsolved-package [0]: offset reset (at offset 1249710) to END: fetch failed due to requested offset not available on the broker: Broker: Offset out of range

It seems while updating, the offset of some topics was hindered. It would have happened due to not following the exact steps of the upgrade.

harshad16 commented 2 years ago

Now this is working. closing this