masu-listener unable to recover from CoordinatorNotAvailableError

dccurtis commented 5 years ago

The masu-listener in the hccm-qa environment experienced the following error:

Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError

After reaching out to the team that manages Kafka for the platform it was determined that this was a broker problem.

The listener was unable to recover from this event and logged the following errors:

Unable connect to node with id 2: [Errno 111] Connect call failed ('10.128.24.26', 9092)
Unable connect to node with id 1: [Errno 111] Connect call failed ('10.131.1.225', 9092)

Restarting the pod was required to re-connect with the upload service.

Desired enhancements for this issue:

Gracefully recover from this error so in the future no re-start would be required.
Send appropriate prometheus status in the event that we are no longer connected to the upload service to make detection and correction easier for the operator.

adberglund commented 5 years ago

A) Logs sent up B) Some sort of alert that things are not right C) A way to restart the pod to try to get back to healthy

lcouzens commented 5 years ago

Verified Commit: "d5d24666043e6e6bdb6964ac7e093f523f9a6b86"

project-koku / masu

masu-listener unable to recover from CoordinatorNotAvailableError #475