Open roman-vynar opened 2 years ago
Hi @roman-vynar, thanks for suggestion. I think it would be a nice enhancement to have.
But is there anything specific from this issue that is not in #8538?
But is there anything specific from this issue that is not in #8538?
@lgfa29 yes, it's a bit different as it's not about a service/task update.
When you run a service with count=1 you can get into the situation when no allocs are running at all (because drain stops it immediately) and it's not even the fact it will do that soon if there are other problems with re-scheduling due to a lack of resources etc. I think when a node is set to drain it should re-allocate the existing allocs first and then kill them.
It's not only about count=1 services. The current draining functionality kills all allocs running on the node at once and you may end up badly when you can't get them back on the new nodes. May be we need a new feature, tell an alloc to re-allocate and this what the drain thing should do?
Any suggestions are welcome. Thanks!
Got it, thanks for extra details @rhuddleston and @roman-vynar!
I think when a node is set to drain it should re-allocate the existing allocs first and then kill them.
I can see some scenarios where you may not want this happen. For example, services that have single-write
volume, you can't start the new alloc before stopping the old one.
So I think this may need to be an opt-in behaviour, either at drain time or per group
? 🤔
I set this for our community triage process so we can have discuss it a bit more.
Thanks for the idea!
Dropping a note here for us to consider the impact on the ephemeral_disk.migrate
for any changes we make on this.
Hello,
Let's say we have a service with count=1. When I set the node to drain, Nomad will stop allocs on that node immediately. My single service alloc will be gone. And only afterwards, Nomad will start re-allocating it (migrate phase).
I would expect it should bring a new alloc on the other node and only then kill the old one and not leaving us with count=0. In fact, it does not matter if count is 1 or not. The point here the desired count becomes lower during "migration".
Related https://github.com/hashicorp/nomad/issues/8538
As I understand it's all about a term of
oversubscription
. I think it's nice to have an option to guarantee the desired count of allocs for "update" and/or "migrate" (drain) phases.And the last if you can suggest how it can be fixed w/o doing +1 before service update or node drain and then -1 once we are done so the guaranteed count of allocs are satisfied. That's really important.
Thanks!