Open mrkurt opened 3 years ago
Hi @mrkurt, thanks for the report.
I think I was able to reproduce the issue, though I don't know if represents your situation well. It's kind of had to explain, so I recorded a video. Here are the files I used as well.
Do you think this is a fair reproduction of your scenario?
We will need more time to dig deeper into this, but at a first look it seems like Nomad is not reevaluating constraints on alloc reschedules.
Oh clever! That looks like what we're seeing, yeah. I'm incredibly relieved you were able to reproduce. 😄
Oh clever! That looks like what we're seeing, yeah. I'm incredibly relieved you were able to reproduce. 😄
It took me a while to find a consistent way to reproduce it, but now it should be a bit easier to debug it.
Your detailed instructions also helped a lot to, so thank you 🙂
Nomad version
Nomad v1.1.0 (2678c3604bc9530014208bc167415e167fd440fc)
Operating system and Environment details
Ubuntu 18.04
Issue
We have a job with these constraints/affinities:
When the job is created or updated, it does what you'd expect (one allocation per
fly_region
). It's behaving weirdly when allocations fail, though, and violating thedistinct_property
constraint when it reschedules them. Two allocations failed over night and the replacements ended up creating a situation where 2x allocs were running on clients withfly_region == 'atl'
andfly_region == 'maa'
.Reproduction steps
I can't reproduce this on purpose but it's happened at least twice with the same job spec.
Is this actually a bug or am I misunderstanding how this is supposed to work?