Open sluebbert opened 5 months ago
Hi @sluebbert and thanks for raising this issue. I am wondering if the autoscaling is a trigger or the cause here and if this has uncovered a problem within deployments themselves. It might be hard for us to reproduce, so if you uncover any additional information, please let us know as it will be very useful. If you had the autoscaler logs from around the time this scaling action was triggered, that would also be useful and help create a reproduction in the future.
It looks like autoscaling is just a trigger. We encountered the same issue with another job this morning, but due to migrations from draining a node.
Something just seems out of sync. Here are some screenshots taken now of a "deployment" that started a little over 30 minutes ago.
Overview UI:
Deployments UI:
Versions UI:
We were draining and re-enabling nodes every few minutes 30 minutes ago for other purposes which may have help get us into this situation here.
We just found a job stalled like this for 15 days. Over that time the existing allocations from the previous "version" exited for natural reasons and we ended up with no running allocations when the original intent of the system 15 days ago was to auto scale down from 3 to 2.
I didn't realize this as a risk originally. At the moment we have 14 other jobs that stuck pending manual intervention.
This one is bizarre to me as it's just a scaling event from 3 to 2 and looks like it should have finished successfully over a day ago:
It appears we will need to create some type of monitor on our side to watch for these situations and clean them up.
Nomad version
1.7.3
Operating system and Environment details
Ubuntu 22.04.2 LTS
Issue
From time to time auto scaling results in a deployment that partially fails. The deployment never completes and gets hung in a state that seems corrupt.
In the example below, the autoscaler is dropping an instance count from 5 to 2. We came across this instance an hour after it started. Given that the deployment is auto promoting and auto reverting, and that there is a tight progress deadline of 31 seconds, I don't understand why the deployment is still running and stuck hours later. The deployment starts at 08:54:13 but with events and deadlines smeared across about 4 minutes also doesn't make sense to me. I'm not sure what it was doing.
The UI overview tab shows a summary that does not include canary information:
The UI deployment tab does show canary information:
The chart showing auto scaling behavior:
The change recorded for the deployment:
A request to the API to get the deployment returns:
A request to the API to get the canary allocation returns:
Consul shows all 3 allocations are healthy.
Reproduction steps
We have 100s of successful auto scaling deployments every day across many services. I haven't been able to recreate on demand.
Expected Result
Actual Result
Nomad Client logs
Consul logs for canary alloc:
Nomad logs don't turn up anything interesting on that alloc.