Ignore pod scheduling errors during deployments

jfuechsl commented 4 years ago

We are seeing the following issue during app deployments periodically (especially when the app has many pods and the cluster is highly utilized in terms of resource requests).

Deployment is triggered (either git push or deis config:set, etc.)
Hephy modifies the K8s deployment and Pods are scaled up and terminated.

At some point hephy returns an error, similar to:

Error: Unknown Error (400): {"detail":"(app::deploy): There was a problem deploying v123. Rolling back process types to release v122.
There was a problem while deploying v123 of some-hephy-app-cmd. Additional information:
0/26 nodes are available: 26 Insufficient memory."}

In a cluster that is auto-scaled, this is a temporary condition and it should not lead to a deployment failure.

jfuechsl commented 4 years ago

One strategy we tried to circumvent the problem is the following:

When a hephy app is created, we set the deployment strategy of the K8s deployment to type RollingUpdate, maxSurge==0, and maxUnavailable==20% (for example).
Our hope was that this would prevent K8s from creating more pods than the deployment's desired size, thus avoiding to run into any cluster resource limitations.
However that didn't work. The issue persists.
Our best guess atm as to why is: When the deployment is started (with the strategy above), a new replicaset is created with scale zero. Then the old replicaset is scaled down by e.g. -20% and the new replicaset is scaled up by +20%. There seems to be a period of time where new pods are already scheduled , but old ones still in terminating and still using the resource requests. During that time the new pods can potentially oversubscribe the cluster's resources and trigger the error.

dmcnaught commented 4 years ago

@jfuechsl Is there a bigger problem if the scheduling failure doesn’t get fixed by autoscaling and you have DEIS_IGNORE_SCHEDULING_FAILURE set? We have logic in our deployments to retry deployment when there are scheduling failures - I wonder if workflow could optionally do this rather than just ignoring the failures.

jfuechsl commented 4 years ago

@dmcnaught good question. In that case I suppose the deployment would time out eventually and revert. Retrying would not be desired in such cases as it should just wait for K8s to finish the deployment rollout and have enough time to schedule everything properly.

dmcnaught commented 4 years ago

@jfuechsl Right - it wouldn’t really be retrying the deploy, but retrying the check to see if it was successful or still has a scheduling error. When enough time has passed to timeout (something like DEIS_IGNORE_SCHEDULING_TIMEOUT) then it can return an accurate pass or fail.

jfuechsl commented 4 years ago

@dmcnaught Great idea, let me update the PR.

jfuechsl commented 4 years ago

One problem with that is, we are inspecting a pod's event stream to look for failures such as FailedScheduling. Once that event happened, it will always be included in the events. This makes a timeout based check retry impossible. IMO, we shouldn't look at the pod events at all but only at the Pod phase and Pod conditions. They represent the point-in-time status of the Pod and can thus be handled with a timeout-based waiting logic on a per Pod basis. That would however necessitate a bigger change in a Deployment's wait_until_ready logic.

Cryptophobia commented 4 years ago

Thank you for the great discussion @jfuechsl and @dmcnaught .

Agreed that it would be nice to look at Pod phases and conditions rather than Scheduling Events on the k8s api. However, that would require a bit more work. As of right now, this little feature hack saves some headaches on autoscaling clusters and doesn't change the existing behaviour for users so it is good IMO.

@jfuechsl can you document the new ENV variable on the workflow docs. I added a note on the PR.

jfuechsl commented 4 years ago

@Cryptophobia thanks very much. Yes, I will update the docs and submit a PR.

jfuechsl commented 4 years ago

@Cryptophobia I submitted https://github.com/teamhephy/workflow/pull/108

Cryptophobia commented 3 years ago

All relevant PRs for this feature have been merged in a long while. Closing now. Thank you @jfuechsl .

teamhephy / controller

Ignore pod scheduling errors during deployments #113