hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.83k stars 1.95k forks source link

Support canaries for task migrations #5916

Open eigengrau opened 5 years ago

eigengrau commented 5 years ago

It would be convenient if migrations (e.g., triggered by nomad node drain and governed by the migrate stanza) were able to create canary allocations. These allocations would be created before old allocations are destroyed, and old allocations would not be stopped until replacement allocations are deemed healthy. Unlike canaries, these replacements would probably not require manual promotion, but would instead promote automatically when healthy.

This feature would be useful to us since we run some workloads for staging or development purposes with count = 1. While these jobs do not require high-availability, it would nonetheless be preferable if services were not disrupted by a node drain operation. Since some of the applications we run require a 30-40 minute initialization time (I know, right) before service becomes available, the impact of a node drain in these cases can become a nuisance.

mildred commented 5 years ago

The same use case could be implemented by a rolling update, but instead of removing an allocation first, before starting a new allocation, it could start a new allocation, wait for its health check to become successful, and only then stop an old allocation, and continue to do that until all old allocations are stopped.

apollo13 commented 4 years ago

This feature would be useful to us since we run some workloads for staging or development purposes with count = 1. While these jobs do not require high-availability, it would nonetheless be preferable if services were not disrupted by a node drain operation.

Same here. Ie we can deal with downtime in case of a crash but would like to be able to do maintenance operations without downtime. Note that the services in question could all be run HA, but there is generally no need to waste the CPU/RAM by setting the count=2

Himura2la commented 2 years ago

I was ready to file the save issue, but fortunately, I found this one. Absolutely must-have feature for draining nodes that have single-instance jobs. I don't see any reason why it is not already implemented. While we have an ability to update jobs seamlessly, the migration task looks no different.

ivantopo commented 7 months ago

Hello people! We just got hit by this same behavior. We have jobs that generally should only have 1 allocation running at any given time, although we support having 2 of them running for short periods of time while doing canary updates.

Same as mentioned before, it would be very nice if migration worked like a canary update with auto promotion to avoid downtime.

Is there anything we can do to help moving this forward?