The masu-listener in the hccm-qa environment experienced the following error:
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
After reaching out to the team that manages Kafka for the platform it was determined that this was a broker problem.
The listener was unable to recover from this event and logged the following errors:
Unable connect to node with id 2: [Errno 111] Connect call failed ('10.128.24.26', 9092)
Unable connect to node with id 1: [Errno 111] Connect call failed ('10.131.1.225', 9092)
Restarting the pod was required to re-connect with the upload service.
Desired enhancements for this issue:
Gracefully recover from this error so in the future no re-start would be required.
Send appropriate prometheus status in the event that we are no longer connected to the upload service to make detection and correction easier for the operator.
The masu-listener in the hccm-qa environment experienced the following error:
After reaching out to the team that manages Kafka for the platform it was determined that this was a broker problem.
The listener was unable to recover from this event and logged the following errors:
Restarting the pod was required to re-connect with the upload service.
Desired enhancements for this issue: