hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.93k stars 1.96k forks source link

Constraints violated when allocation failed + rescheduled? #11199

Open mrkurt opened 3 years ago

mrkurt commented 3 years ago

Nomad version

Nomad v1.1.0 (2678c3604bc9530014208bc167415e167fd440fc)

Operating system and Environment details

Ubuntu 18.04

Issue

We have a job with these constraints/affinities:

{"Constraints"=>
  [{"LTarget"=>"${meta.fly_region}",
    "Operand"=>"distinct_property",
    "RTarget"=>"1"},
   {"LTarget"=>"${meta.fly_region}",
    "Operand"=>"set_contains_any",
    "RTarget"=>"atl,cdg,gru,hkg,iad,lhr,maa,nrt,sjc,syd"}],
 "Spreads"=>
  [{"Weight"=>100, "Attribute"=>"${meta.fly_region}", "SpreadTarget"=>nil}],
 "Affinities"=>
  [{"Weight"=>50,
    "LTarget"=>"${meta.fly_region}",
    "Operand"=>"set_contains_any",
    "RTarget"=>"atl,cdg,gru,hkg,iad,lhr,maa,nrt,sjc,syd"}]}

When the job is created or updated, it does what you'd expect (one allocation per fly_region). It's behaving weirdly when allocations fail, though, and violating the distinct_property constraint when it reschedules them. Two allocations failed over night and the replacements ended up creating a situation where 2x allocs were running on clients with fly_region == 'atl' and fly_region == 'maa'.

Reproduction steps

I can't reproduce this on purpose but it's happened at least twice with the same job spec.

Is this actually a bug or am I misunderstanding how this is supposed to work?

lgfa29 commented 3 years ago

Hi @mrkurt, thanks for the report.

I think I was able to reproduce the issue, though I don't know if represents your situation well. It's kind of had to explain, so I recorded a video. Here are the files I used as well.

Do you think this is a fair reproduction of your scenario?

We will need more time to dig deeper into this, but at a first look it seems like Nomad is not reevaluating constraints on alloc reschedules.

mrkurt commented 3 years ago

Oh clever! That looks like what we're seeing, yeah. I'm incredibly relieved you were able to reproduce. 😄

lgfa29 commented 3 years ago

Oh clever! That looks like what we're seeing, yeah. I'm incredibly relieved you were able to reproduce. 😄

It took me a while to find a consistent way to reproduce it, but now it should be a bit easier to debug it.

Your detailed instructions also helped a lot to, so thank you 🙂