Open valodzka opened 1 year ago
Hi @valodzka! This is a really cool idea (and I like the whimsy of the suggested name :grinning: ).
There's a couple of interesting challenges here. The scheduler evaluates the entire job so that it can determine whether or not to remove or update-in-place allocations that are already running. So not all dimensions that are exhausted make sense if you try to check a single node. Ex. spread
blocks or distinct_hosts
constraints can only be determined with respect to all the other allocations. And we'd probably have to extract the logic for feasibility checking a single node from the rest of the scheduler.
So maybe given those two problems, the right approach here would be to run a full plan (just like we normally do, similar to nomad job plan
), but then extract more detailed information about why nodes were rejected so we can filter down to one and report back to the user. Or we could add that data to the normal nomad job plan
output and just make it a lot more verbose.
I'll mark this issue for further discussion and roadmapping.
Proposal
Implement a command (separate or as part of the job plan) that explains why a particular group cannot be placed on a particular node (something distantly similar to
aptitude why
). Possible syntax:Output:
Possible additional option is something like
-why-not-count 3
to check why 3 allocations of a group cannot be placed on a particular node.Use-cases
Periodically I stumble across situations when a
nomad plan
shows that some allocation cannot be placed and it's not immediately clear why. Nomad message is fairly cryptic in even in verbose mode (and especially with a lot of nodes):It will be time saving to have the ability to check why task group cannot be placed to specific node.
Attempted Solutions
Currently it requires a lot of configs checks to get understanding why a job cannot be scheduled on a particular node. It is doable but can be quite time consuming.