hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.83k stars 1.95k forks source link

[Feature] Nomad constraint for consul maint mode #2283

Open jshaw86 opened 7 years ago

jshaw86 commented 7 years ago

We would like there to be a scheduling constraint for when consul is in maintenance mode. A specific nomad client may get into a state where it can accept jobs but consul is in maintenance mode.

The affect then is if there are routing services like Fabio depending on the consul health checks the job is never routed to and considered down.

dadgar commented 7 years ago

I think if we fingerprint this you could add a constraint to avoid nodes in maintenance mode.

camerondavison commented 7 years ago

Would that move things off of that node? Seems like this could be taken care of with putting nomad into drain mode if consul is in maintenance mode.

dadgar commented 7 years ago

@a86c6f7964 it wouldn't automatically move them off, but if the job was rerun it would or if, as you said, you put Nomad in drain mode. Seems like the right thing to do.

Actually be putting Nomad into drain mode you wouldn't need the constraint at all.

jshaw86 commented 7 years ago

@a86c6f7964

Seems like this could be taken care of with putting nomad into drain mode if consul is in maintenance mode.

You are correct accept what if you bring nomad out of drain but consul remains in maint? The current behavior is nomad will schedule to the node even with consul in maint causing all the consul infrastructure to ignore those containers.

Ideally this never happens but we've found consul to be non-deterministic when bringing out of maint mode.

camerondavison commented 7 years ago

In theory if there was a config to put nomad into drain when consul is set to maintenance then you would not be able to pull nomad out of drain while consul is in maintenance.

dadgar commented 7 years ago

@jshaw86 The mechanism to block Nomad from scheduling on that node is marking it as drain. I don't think we would want to special any other behavior. So maybe this is more of a Consul issue if you are saying it is non-determinstic?

jshaw86 commented 7 years ago

@a86c6f7964 yes in theory not in practice currently :).

jshaw86 commented 7 years ago

@dadgar yea maybe so, it would be nice to have something in nomad to prevent this from happening just to eliminate the potential of getting into this state but I can try to file an issue with consul.

I know you guys are trying to keep nomad and consul un coupled so not sure what the best approach would be if you were to do it.

dadgar commented 7 years ago

@jshaw86 I think what makes the most sense is to add an option in the client config that when Nomad is put into and out of drain mode it toggles the appropriate maintenance mode on Consul.

Let me know what you think

pierreca commented 2 years ago

nomad drain toggling maintenance mode in consul would be good but not sufficient, as consul maintenance can be enabled without going through nomad drain.

fingerprinting is an interesting route but you could also consider registering a node health check with consul and when that health check fails because of maintenance mode maybe drain it.

Because consul can also be used for service discovery, and services aren't advertised in maintenance mode, it seems reasonable for nomad to consider this allocation as failed or lost (or maybe another state, like "unusable") and reschedule it on another node.

Continuing to schedule, or considering an allocation healthy, when maintenance mode is enabled on that node (and therefore its health checks are failing) lead to broken or undiscoverable service instances.