hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.81k stars 1.94k forks source link

blue/green deployments with spread result in unbalanced workload #9183

Open hvindin opened 3 years ago

hvindin commented 3 years ago

If filing a bug please include the following:

Nomad version

Nomad v0.12.3 (2db8abd9620dd41cb7bfe399551ba0f7824b3f61)

Operating system and Environment details

NAME="Red Hat Enterprise Linux Server"
VERSION="7.7 (Maipo)"
ID="rhel"
VARIANT="Server"
VERSION_ID="7.7"

Issue

When trying to achieve an ideal even spread of job allocations between two consul DC's we are specifying

     "Spreads": [
        {
          "Attribute": "${attr.consul.datacenter}",
          "Weight": 50,
          "SpreadTarget": null
        }
      ],

in all of our job definitions. There are two consul datacenters with exactly the same number of workload servers which should theoretically mean that we ideally always end up with 50% of allocations in one datacenter and 50% in the other.

If something goes wrong, for example an ESXi host underlying the VM's running nomad dies, killing off a huge number of workload servers in one site or any other major disruption which causes nomad to start allocating jobs unevenly across consul datacenters (which is desirable during the outage event) it becomes impossible to rebalance the jobs by deploying a new blue/green declared job.

This appears to be because, as indicated by the documentation, we are specifying the same number of canaries as we are the eventually desired count for each TaskGroup.

So if we end up (in a simple example) with a job that starts out with 1 TaskGroup, declaring a Count of 4, with the above Spread we have seen the following events occur:

  1. Major disruption occurs in one site, jobs migrate to the available compute capacity in the still healthy site, for example the job we're following might end up with 4 allocs in the one site, with 0 allocs in the site which suffered the outage. (for the duration of the outage this is totally fine and actually desirable.
  2. The disruption is fixed, compute capacity is restored. Everyone is happy. There are still 4 allocs running in one site, 0 allocs in the now healthy site so obviously something needs to be done to get the jobs back to "roughly evenly distributed" to avoid the risk of a similar outage even hitting the site with all 4 allocs deployed and thus causing a total outage while the jobs are moved over to the healthy site again.
  3. To work around this one would immediately assume that deploying the job as a regular blue/green deployment as we have set up elsewhere would result in the new allocations being distributed 2/2 across the data centers.

However, what we have noticed actually occurs is that when we schedule a job with 4 canaries, to match up with the job with 4 count, all 4 canaries are spun up in the consul datacenter with 0 allocs (which I guess for that specific moment in time does technically result in a 50/50 even split across the datacenters) and then when the canaries come healthy and are marked for promotion the 4 allocs that were there previously spin down. Leaving us, once again, with 4 allocations in one datacenter and 0 allocations in the other.

The only way we have been able to work around this is by messing with the canary count on the job to get less canaries deployed so that we get a chance of the job coming up evenly.

This seems like it's possibly not the intent of the "Spread" functionality as I would have thought that the "Spread" that you would declare in your job would be the desired state, after all the canaries were promoted, and other containers spun down etc. as opposed to the spread for when canaries are spun up which then becomes somewhat meaningless when they are spun down.

It's entirely possible that I'm simply misunderstanding the Spread functionality and the approach described for performing blue/green deployments, in which case let me know and a pointer in the direction of the docs which I've likely misread would be much appreciated, but it does strike me that this behaviour seems a little unexpected from what I've read about the Spread definition.

Reproduction steps

See above description of how the state was achieved.

tgross commented 3 years ago

Hi @hvindin! Thanks for opening this issue and describing the scenario in such detail.

This is definitely a tricky scenario where we've got two features that are interacting in unexpected ways. Your understanding of how spread should work is correct. My hypothesis is that the scheduler is calculating the spread across both versions of the job, which probably looks correct and passes most testing when there's only 1 canary (or a small number relative to the total count), but obviously that results in really bad behavior when you're trying to do blue/green.

So this looks like a bug to me and we'll need to investigate further. I don't have an immediate workaround for you other than temporarily dropping the canaries down to 1 when you run the job and restore from the fail-over. Once you've done that you can re-register the job with canaries == count without causing a new deployment (because that's a non-destructive change).

I've re-titled the issue just for clarity in getting this looked into.

hvindin commented 3 years ago

Thanks @tgross

That is essentially what we've ended up doing, knowing that the version of the job is essentially the same, and hence a blue/green deployment isn't really required because we aren't changing the versions, we've just made sure that our operators who would regularly just hit the run button to deploy a job are aware that if they see a job that's unbalanced between sites during regular operations they need to redeploy essentially the same job but with the canary count set to 2 and the max_parallel update set to 1, or they can actually do the maths to work out what they should set the values to for things to start working more quickly, but that's the only combination that I could think of reasonably quickly that works for jobs with any count from 4 (our minimum) up to any maximum desired count.

So it's a fixable problem in terms of cluster state but we have found that it's problematic when a job is sitting in a very unbalanced state for a while and the assumption everyone who would be deploying jobs seemed to have was that simply rerunning the job would fix it, but in reality rerunning a job with a canary count == desired count will always end up with an unbalanced result if that's what the jobs current state is.

alitvak69 commented 3 years ago

I just discovered that the same happens regardless of using spread feature. I am running nomad 0.11.1 and have a job with 3 tasks running in a single DC. See configuration below. If keep canary number to 1 or 2 all tasks remaining evenly spread across nodes after update is finished. As soon as I change number of canaries to 3 which is equal to original running tasks, distribution becomes uneven, Two out of three canaries start on the same node and after promotion job ends up using only 2 nodes out of three available. Playing with parallel makes no difference.

job "smanager-pbxsm-featuretest-staging" { datacenters = [ "chi-pbx"]

type = "service"

meta { required_environment = "staging" required_platform = "pbxsm-featuretest" } constraint { attribute = "${meta.environments}" operator = "set_contains" value = "staging" } update { canary = 2 max_parallel = 3 min_healthy_time = "30s" healthy_deadline = "5m" auto_revert = true auto_promote = false }

group "smanager" { count = 3

FlorianNeacsu commented 2 weeks ago

I encountered this limitation and, as a workaround, I’m adjusting the spread target percentage to ensure Nomad schedules the canaries across all targets — in my case, AWS AZs. For example, if the current allocation spread is 2, 1, and 0 across AZs a, b, and c, respectively, after deploying the canaries(same number as the total count, 3), the spread will be 3, 2, and 1(16%, 32%, 48%). Once the canaries are promoted, the allocation will balance out to 1, 1, 1. The caveat is that after the job becomes balanced, the adjusted spread percentage needs to be reverted, requiring a second deployment, but this time the job will maintain the balanced state.