project-koku / masu

This is a READ ONLY repo. See https://github.com/project-koku/koku for current masu implementation
GNU Affero General Public License v3.0
5 stars 6 forks source link

masu-listener unable to recover from CoordinatorNotAvailableError #475

Closed dccurtis closed 5 years ago

dccurtis commented 5 years ago

The masu-listener in the hccm-qa environment experienced the following error:

Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError
Group Coordinator Request failed: [Error 15] CoordinatorNotAvailableError

After reaching out to the team that manages Kafka for the platform it was determined that this was a broker problem.

The listener was unable to recover from this event and logged the following errors:

Unable connect to node with id 2: [Errno 111] Connect call failed ('10.128.24.26', 9092)
Unable connect to node with id 1: [Errno 111] Connect call failed ('10.131.1.225', 9092)

Restarting the pod was required to re-connect with the upload service.

Desired enhancements for this issue:

adberglund commented 5 years ago

A) Logs sent up B) Some sort of alert that things are not right C) A way to restart the pod to try to get back to healthy

lcouzens commented 5 years ago

Verified Commit: "d5d24666043e6e6bdb6964ac7e093f523f9a6b86"