Closed jfuechsl closed 3 years ago
One strategy we tried to circumvent the problem is the following:
RollingUpdate
, maxSurge==0
, and maxUnavailable==20%
(for example).@jfuechsl Is there a bigger problem if the scheduling failure doesn’t get fixed by autoscaling and you have DEIS_IGNORE_SCHEDULING_FAILURE set? We have logic in our deployments to retry deployment when there are scheduling failures - I wonder if workflow could optionally do this rather than just ignoring the failures.
@dmcnaught good question. In that case I suppose the deployment would time out eventually and revert. Retrying would not be desired in such cases as it should just wait for K8s to finish the deployment rollout and have enough time to schedule everything properly.
@jfuechsl Right - it wouldn’t really be retrying the deploy, but retrying the check to see if it was successful or still has a scheduling error. When enough time has passed to timeout (something like DEIS_IGNORE_SCHEDULING_TIMEOUT) then it can return an accurate pass or fail.
@dmcnaught Great idea, let me update the PR.
One problem with that is, we are inspecting a pod's event stream to look for failures such as FailedScheduling
. Once that event happened, it will always be included in the events. This makes a timeout based check retry impossible.
IMO, we shouldn't look at the pod events at all but only at the Pod phase and Pod conditions. They represent the point-in-time status of the Pod and can thus be handled with a timeout-based waiting logic on a per Pod basis. That would however necessitate a bigger change in a Deployment's wait_until_ready
logic.
Thank you for the great discussion @jfuechsl and @dmcnaught .
Agreed that it would be nice to look at Pod phases and conditions rather than Scheduling Events on the k8s api. However, that would require a bit more work. As of right now, this little feature hack saves some headaches on autoscaling clusters and doesn't change the existing behaviour for users so it is good IMO.
@jfuechsl can you document the new ENV variable on the workflow docs. I added a note on the PR.
@Cryptophobia thanks very much. Yes, I will update the docs and submit a PR.
@Cryptophobia I submitted https://github.com/teamhephy/workflow/pull/108
All relevant PRs for this feature have been merged in a long while. Closing now. Thank you @jfuechsl .
We are seeing the following issue during app deployments periodically (especially when the app has many pods and the cluster is highly utilized in terms of resource requests).
git push
ordeis config:set
, etc.)