Agent hangs if it doesn't start with the right permissions

mwringe commented 7 years ago

When starting the agent with the proper permissions, it will throw the following error in the logs and hang:

1 node_event_consumer.go:72] Error obtaining information about the agent pod [openshift-infra/hawkular-openshift-agent-qzg21]. err=User "system:serviceaccount:openshift-infra:hawkular-openshift-agent" cannot get pods in project "openshift-infra"

If the SA is given the proper permissions, the pod will still hang. If the pod is restarted it will startup properly.

By hanging like this, its left in a position where its indicating that its ready and running properly (status 1/1). At the very least, if it cannot properly continue, it should exit so that a new pod can be started in its place.

In this case, I believe the agent should wait and attempt to connect a few more times after some delay. We could even use a 'readiness probe' here to determine when the agent reaches a ready state.

mwringe commented 7 years ago

We should also consider the situation where the permission is revoked while the agent is running and how that should be handled as well

jmazzitelli commented 7 years ago

We have a health probe endpoint - when errors like this occur, we can flip these values and the health probe will see a problem and act appropriately (shutdown / restart the agent?): https://github.com/hawkular/hawkular-openshift-agent/blob/master/emitter/health/health_emitter.go

mwringe commented 7 years ago

I think a more appropriate action would be to have the pod stay in the 'not ready' phase until the permission is granted. We can use a readiness probe for that. The logs should clearly indicate what the problem is and how to fix it. And once the permission has been granted the pod can continue and enter the ready state.

I don't think we want to restart the pod in this case. That will cause a crashloopback problem which looks like our pods are really unstable and crashing. Is this also more portrayed as an error condition back to the user.

I think the same thing should be done if the permission is revoked. We shouldn't restart the pod in the case, but perhaps log the error and continuously check if the permission has been re-granted. And in this case to also exit the ready state.

hawkular / hawkular-openshift-agent

Agent hangs if it doesn't start with the right permissions #111