hashicorp / nomad-autoscaler

Nomad Autoscaler brings autoscaling to your Nomad workloads.
Mozilla Public License 2.0
427 stars 83 forks source link

Scaling down "gce-mig" drains node with running batch job #571

Open vincenthuynh opened 2 years ago

vincenthuynh commented 2 years ago

When scaling down, the autoscaler picks a node with a long running batch job to drain despite using the empty_ignore_system strategy:

target "gce-mig" {
  project                = "${var.project}"
  region                 = "${var.region}"
  mig_name               = "${var.mig_name}"
  datacenter             = "${var.autoscale_dc}"
  node_drain_deadline    = "30m"
  node_selector_strategy = "empty_ignore_system"
}

What we end up seeing is any long running alloc that exceeds the drain deadline will be killed: image

Sample batch job:

job "batch" {
  datacenters = ["dispatch"]
  type        = "batch"
  parameterized {
    meta_optional = ["sleep", "splay"]
  }
  meta {
    sleep = "180"
    splay = "60"
  }
  group "batch" {
    task "sleep" {
      driver = "docker"
      config {
        image   = "alpine:3.13"
        command = "/bin/ash"
        args    = ["${NOMAD_TASK_DIR}/sleep.sh"]
      }
      template {
        data = <<EOF
#!/usr/bin/env bash
dur=$((${NOMAD_META_sleep} + RANDOM % ${NOMAD_META_splay}))
echo "Sleeping for ${dur}s"
sleep $dur
echo "Done"
        EOF
        destination = "local/sleep.sh"
      }
      resources {
        cpu    = 100
        memory = 10000
      }
    }
  }
}

Policy:

scaling "cluster_policy" {
  enabled = true
  min     = 1
  max     = 5

  policy {
    cooldown            = "5m"
    evaluation_interval = "10s"

    check "batch_jobs_in_progess" {
      source = "prometheus"
      query  = "sum(nomad_nomad_job_summary_queued{exported_job=~\".*dispatch.*\"} + nomad_nomad_job_summary_running{exported_job=~\".*dispatch.*\"}) OR on() vector(0)"
      strategy "pass-through" {}
    }

    target "gce-mig" {
      project                = "${var.project}"
      region                 = "${var.region}"
      mig_name               = "${var.mig_name}"
      datacenter             = "${var.autoscale_dc}"
      node_drain_deadline    = "30m"
      node_selector_strategy = "empty_ignore_system"
    }
  }
}

Autoscaler: 0.3.4 Nomad: 1.1.6

lgfa29 commented 2 years ago

Hi @vincenthuynh 👋

Yeah, empty_ignore_system won't help you here since it only applies to jobs with type system.

I think it would make sense to skip nodes that have active batch jobs, since they are often time-consuming to restart. Something like avoid_active_batch.

The problem is that we currently only allow a single node_selector_strategy, and this is not really mutually exclusive with the other options, so we'll have to come up with a way to define multiple strategies and decide how they would interact.

vincenthuynh commented 2 years ago

Hi @lgfa29 , Thanks for the reply and insight!

Perhaps I'm misunderstanding the docs, but I'm interpreting it as empty_ignore_system being an extension of empty and should also apply to batch workloads?

I'm not sure if this is related, but at one point, we end up with a single node that is set to drain and ineligible, preventing any new jobs from being scheduled.

lgfa29 commented 2 years ago

Oh sorry, I totally misunderstood your original message 😅

So yeah, empty_ignore_system is like empty in the sense that it will only pick nodes that don't have allocations running, but it will not count system jobs, since those are always running in all clients, which would prevent any client from being picked.

batch jobs do not apply to any of this, so a client with a running batch job should not have been picked. Do you happen to have some logs from the Autoscaler around the time when the scaling down happened?

vincenthuynh commented 2 years ago

Hi @lgfa29, Sorry for the delay, but we encountered the issue again today where we are left with a single client in draining when scaling in.

I've sent the complete autoscaler allocation logs over to nomad-oss-debug@hashicorp.com with subject: GH-571: Scaling down "gce-mig" drains node with running batch job. Please let me know if there is another mailbox you might want me to forward it to.

But here's a snippet of the logs. Notice the seg violation at the end:

...
2022-03-22T18:52:31.011Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 2a2ce1ce-891f-64ca-c5b5-f5761ec3fed1 is draining"
2022-03-22T18:52:41.011Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 2a2ce1ce-891f-64ca-c5b5-f5761ec3fed1 is draining"
2022-03-22T18:52:51.009Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 2a2ce1ce-891f-64ca-c5b5-f5761ec3fed1 is draining"
2022-03-22T18:52:56.441Z [ERROR] policy_manager: encountered an error monitoring policy IDs: error="failed to call the Nomad list policies API: Unexpected response code: 504 (upstream request timeout)"
2022-03-22T18:53:01.009Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 2a2ce1ce-891f-64ca-c5b5-f5761ec3fed1 is draining"
2022-03-22T18:53:11.008Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 2a2ce1ce-891f-64ca-c5b5-f5761ec3fed1 is draining"
2022-03-22T18:53:21.012Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 2a2ce1ce-891f-64ca-c5b5-f5761ec3fed1 is draining"
2022-03-22T18:53:31.011Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 2a2ce1ce-891f-64ca-c5b5-f5761ec3fed1 is draining"
2022-03-22T18:53:41.010Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 2a2ce1ce-891f-64ca-c5b5-f5761ec3fed1 is draining"
2022-03-22T18:53:51.018Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 2a2ce1ce-891f-64ca-c5b5-f5761ec3fed1 is draining"
2022-03-22T18:54:01.441Z [INFO]  policy_eval.worker: scaling target: id=f3b2b787-29b0-5c8b-817b-13c5b59026fc policy_id=4b2186a6-3c82-e647-34ba-b64edde20b85 queue=cluster target=gce-mig from=3 to=1 reason="capped count from 0 to 1 to stay within limits" meta="map[nomad_autoscaler.count.capped:true nomad_autoscaler.count.original:0 nomad_autoscaler.reason_history:[scaling down because metric is 0] nomad_policy_id:4b2186a6-3c82-e647-34ba-b64edde20b85]"
2022-03-22T18:54:01.933Z [INFO]  internal_plugin.gce-mig: triggering drain on node: node_id=a99034f7-4d29-9855-511e-4f075f92a692 deadline=2h0m0s
2022-03-22T18:54:01.933Z [INFO]  internal_plugin.gce-mig: triggering drain on node: node_id=dfac2421-3e2a-3bce-d1ef-dd6e0510b144 deadline=2h0m0s
2022-03-22T18:54:01.959Z [INFO]  internal_plugin.gce-mig: received node drain message: node_id=a99034f7-4d29-9855-511e-4f075f92a692 msg="All allocations on node "a99034f7-4d29-9855-511e-4f075f92a692" have stopped"
2022-03-22T18:54:01.960Z [INFO]  internal_plugin.gce-mig: received node drain message: node_id=dfac2421-3e2a-3bce-d1ef-dd6e0510b144 msg="All allocations on node "dfac2421-3e2a-3bce-d1ef-dd6e0510b144" have stopped"
2022-03-22T18:54:01.968Z [INFO]  internal_plugin.gce-mig: received node drain message: node_id=dfac2421-3e2a-3bce-d1ef-dd6e0510b144 msg="Drain complete for node dfac2421-3e2a-3bce-d1ef-dd6e0510b144"
2022-03-22T18:54:01.982Z [INFO]  internal_plugin.gce-mig: received node drain message: node_id=a99034f7-4d29-9855-511e-4f075f92a692 msg="Drain complete for node a99034f7-4d29-9855-511e-4f075f92a692"
2022-03-22T18:54:02.440Z [INFO]  internal_plugin.gce-mig: successfully deleted GCE MIG instances: action=scale_in instance_group=nmdclient-prodclientdispatch
2022-03-22T18:58:06.453Z [ERROR] policy_manager: encountered an error monitoring policy IDs: error="failed to call the Nomad list policies API: Unexpected response code: 504 (upstream request timeout)"
2022-03-22T19:03:16.465Z [ERROR] policy_manager: encountered an error monitoring policy IDs: error="failed to call the Nomad list policies API: Unexpected response code: 504 (upstream request timeout)"
2022-03-22T19:23:56.492Z [ERROR] policy_manager: encountered an error monitoring policy IDs: error="failed to call the Nomad list policies API: Unexpected response code: 504 (upstream request timeout)"
2022-03-22T19:29:06.502Z [ERROR] policy_manager: encountered an error monitoring policy IDs: error="failed to call the Nomad list policies API: Unexpected response code: 504 (upstream request timeout)"
...
...
...
2022-03-22T23:34:02.836Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 50833ea3-ebaf-afc4-4431-1a94b9f4c112 is draining"
2022-03-22T23:34:08.740Z [INFO]  internal_plugin.gce-mig: received node drain message: node_id=50833ea3-ebaf-afc4-4431-1a94b9f4c112 msg="Drain complete for node 50833ea3-ebaf-afc4-4431-1a94b9f4c112"
2022-03-22T23:34:09.425Z [INFO]  internal_plugin.gce-mig: successfully deleted GCE MIG instances: action=scale_in instance_group=nmdclient-prodclientdispatch
2022-03-22T23:38:50.894Z [ERROR] policy_manager: encountered an error monitoring policy IDs: error="failed to call the Nomad list policies API: Unexpected response code: 504 (upstream request timeout)"
2022-03-22T23:44:00.907Z [ERROR] policy_manager: encountered an error monitoring policy IDs: error="failed to call the Nomad list policies API: Unexpected response code: 504 (upstream request timeout)"
2022-03-22T23:48:33.255Z [INFO]  policy_eval.worker: scaling target: id=0b5026a0-ab87-c16f-7fc8-af39d81317ef policy_id=cbbc6a8d-289a-70cb-0181-133161fc760d queue=cluster target=gce-mig from=1 to=2 reason="scaling up because metric is 2" meta=map[nomad_policy_id:cbbc6a8d-289a-70cb-0181-133161fc7
60d]
2022-03-22T23:48:53.222Z [INFO]  policy_eval.worker: scaling target: id=01705001-6c67-8113-2b0a-3fb4a079dec9 policy_id=cbbc6a8d-289a-70cb-0181-133161fc760d queue=cluster target=gce-mig from=2 to=5 reason="scaling up because metric is 5" meta=map[nomad_policy_id:cbbc6a8d-289a-70cb-0181-133161fc7
60d]
2022-03-22T23:49:10.921Z [ERROR] policy_manager: encountered an error monitoring policy IDs: error="failed to call the Nomad list policies API: Unexpected response code: 504 (upstream request timeout)"
2022-03-22T23:52:47.698Z [ERROR] policy_manager: encountered an error monitoring policy IDs: error="failed to call the Nomad list policies API: Get "https://nomad.mydomain.com/v1/scaling/policies?index=3559847&namespace=mynamespace&region=global&wait=300000ms": unexpected EOF"
2022-03-22T23:53:33.041Z [WARN]  policy_eval.broker: eval delivery limit reached: eval_id=b5459b3a-4ed1-43e1-01c3-0a06909428d1 policy_id=cbbc6a8d-289a-70cb-0181-133161fc760d token=4e57887a-a296-142c-6ec3-989acdb9e2e4 count=1 limit=1
2022-03-22T23:54:24.844Z [WARN]  policy_eval.worker: failed to ACK policy evaluation: eval_id=b5459b3a-4ed1-43e1-01c3-0a06909428d1 eval_token=4e57887a-a296-142c-6ec3-989acdb9e2e4 id=0b5026a0-ab87-c16f-7fc8-af39d81317ef policy_id=cbbc6a8d-289a-70cb-0181-133161fc760d queue=cluster error="evaluati
on ID not found"
2022-03-22T23:54:34.844Z [INFO]  policy_manager: starting policy source: source=file
2022-03-22T23:54:34.845Z [INFO]  policy_manager: starting policy source: source=nomad
2022-03-22T23:54:35.001Z [INFO]  file_policy_source: starting file policy monitor: file=/local/policies/hashistack.hcl name=cluster_policy policy_id=cbbc6a8d-289a-70cb-0181-133161fc760d
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0xa1ae86]

goroutine 4439 [running]:
github.com/hashicorp/nomad-autoscaler/sdk.(*ScalingPolicy).Validate(0xc000016e90)
        /home/circleci/project/project/sdk/policy.go:68 +0x26
github.com/hashicorp/nomad-autoscaler/policy.(*Handler).Run(0xc0003083c0, {0x1a22e30, 0xc000198480}, 0xc00009c7d0)
        /home/circleci/project/project/policy/handler.go:158 +0x61f
github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run.func1({0xc0000b2810, 0x24})
        /home/circleci/project/project/policy/manager.go:116 +0x46
created by github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run
        /home/circleci/project/project/policy/manager.go:115 +0xed3
2022-03-22T23:57:52.761Z [INFO]  agent: Starting Nomad Autoscaler agent
2022-03-22T23:57:52.761Z [INFO]  agent: Nomad Autoscaler agent configuration:
2022-03-22T23:57:52.761Z [INFO]  agent: 
2022-03-22T23:57:52.761Z [INFO]  agent:         Bind Addrs: 0.0.0.0
2022-03-22T23:57:52.761Z [INFO]  agent:          Log Level: INFO
2022-03-22T23:57:52.761Z [INFO]  agent:            Plugins: //plugins
2022-03-22T23:57:52.761Z [INFO]  agent:           Policies: /local/policies
2022-03-22T23:57:52.761Z [INFO]  agent:            Version: v0.3.4 (642b7a7)
...
lgfa29 commented 2 years ago

Thank you for the extra info. I got your logs in that email so we're all good.

I think the problem you're experiencing there was fixed in https://github.com/hashicorp/nomad-autoscaler/pull/539. More specifically, the second point there where we tweaked the node selection algorithm to not halt in case of a node marked as ineligible.

For the panic you got, this was fixed in https://github.com/hashicorp/nomad-autoscaler/pull/551, so try running v0.3.5 and see that works for you.

vincenthuynh commented 2 years ago

Hi @lgfa29

The autoscaler left us in the same situation again, where we have a single ineligible client: image

Here are autoscaler logs during the scale-out/in operations:

022-05-27T14:01:30.422Z [INFO]  policy_eval.worker: scaling target: id=2ebbe373-f96c-ffa7-d4b8-6f4060dc664b policy_id=17bc361d-38ba-ca49-6a30-44a04d02f0dd queue=cluster target=gce-mig from=1 to=3 reason="scaling up because metric is 3" meta=map[nomad_policy_id:17bc361d-38ba-ca49-6a30-44a04d02f0dd]
2022-05-27T14:01:50.421Z [INFO]  policy_eval.worker: scaling target: id=a24cd604-fc3c-d7e3-af84-b22d9c24ff73 policy_id=17bc361d-38ba-ca49-6a30-44a04d02f0dd queue=cluster target=gce-mig from=3 to=1 reason="capped count from 0 to 1 to stay within limits" meta="map[nomad_autoscaler.count.capped:true nomad_autoscaler.count.original:0 nomad_autoscaler.reason_history:[scaling down because metric is 0] nomad_policy_id:17bc361d-38ba-ca49-6a30-44a04d02f0dd]"
2022-05-27T14:01:50.802Z [WARN]  internal_plugin.gce-mig: can only identify portion of requested nodes for removal: requested=2 available=1
2022-05-27T14:01:50.816Z [WARN]  internal_plugin.gce-mig: can only identify portion of requested nodes for removal: requested=2 available=1
2022-05-27T14:01:50.826Z [INFO]  internal_plugin.gce-mig: identified portion of requested nodes for removal: requested=2 identified=1
2022-05-27T14:01:50.826Z [INFO]  internal_plugin.gce-mig: triggering drain on node: node_id=711e0706-0b9b-12b1-32d9-7cd0d65a96d0 deadline=2h0m0s
2022-05-27T14:01:50.854Z [INFO]  internal_plugin.gce-mig: received node drain message: node_id=711e0706-0b9b-12b1-32d9-7cd0d65a96d0 msg="All allocations on node "711e0706-0b9b-12b1-32d9-7cd0d65a96d0" have stopped"
2022-05-27T14:04:50.040Z [ERROR] policy_manager: encountered an error monitoring policy IDs: error="failed to call the Nomad list policies API: Unexpected response code: 504 (stream timeout)"
2022-05-27T14:06:50.206Z [WARN]  policy_eval.broker: eval delivery limit reached: eval_id=efd4e974-7cfe-dbbc-8fb6-a7b6cab8f28c policy_id=17bc361d-38ba-ca49-6a30-44a04d02f0dd token=093b0646-45cb-1280-e7a8-f6f071abb74e count=1 limit=1
2022-05-27T14:06:50.846Z [ERROR] policy_eval.worker: failed to evaluate policy: eval_id=efd4e974-7cfe-dbbc-8fb6-a7b6cab8f28c eval_token=093b0646-45cb-1280-e7a8-f6f071abb74e id=a24cd604-fc3c-d7e3-af84-b22d9c24ff73 policy_id=17bc361d-38ba-ca49-6a30-44a04d02f0dd queue=cluster error="failed to scale target: failed to perform scaling action: failed to perform pre-scale Nomad scale in tasks: context done while monitoring node drain: received error while draining node: Error monitoring node: Unexpected response code: 504 (stream timeout)"
2022-05-27T14:06:50.846Z [WARN]  policy_eval.worker: failed to NACK policy evaluation: eval_id=efd4e974-7cfe-dbbc-8fb6-a7b6cab8f28c eval_token=093b0646-45cb-1280-e7a8-f6f071abb74e id=a24cd604-fc3c-d7e3-af84-b22d9c24ff73 policy_id=17bc361d-38ba-ca49-6a30-44a04d02f0dd queue=cluster error="evaluation ID not found"
2022-05-27T14:06:51.843Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 711e0706-0b9b-12b1-32d9-7cd0d65a96d0 is draining"
2022-05-27T14:07:00.451Z [INFO]  policy_eval.worker: scaling target: id=b335644b-54b8-70e0-84fc-1e3f89d10878 policy_id=17bc361d-38ba-ca49-6a30-44a04d02f0dd queue=cluster target=gce-mig from=3 to=1 reason="scaling down because metric is 1" meta=map[nomad_policy_id:17bc361d-38ba-ca49-6a30-44a04d02f0dd]
2022-05-27T14:07:00.928Z [INFO]  internal_plugin.gce-mig: triggering drain on node: node_id=8c9210b5-c593-a402-3a6b-2f25ad8b1319 deadline=2h0m0s
2022-05-27T14:07:00.929Z [INFO]  internal_plugin.gce-mig: triggering drain on node: node_id=db69b914-f4eb-aef0-dc77-24626c2bd303 deadline=2h0m0s
2022-05-27T14:07:00.955Z [INFO]  internal_plugin.gce-mig: received node drain message: node_id=db69b914-f4eb-aef0-dc77-24626c2bd303 msg="All allocations on node "db69b914-f4eb-aef0-dc77-24626c2bd303" have stopped"
2022-05-27T14:07:00.965Z [INFO]  internal_plugin.gce-mig: received node drain message: node_id=8c9210b5-c593-a402-3a6b-2f25ad8b1319 msg="All allocations on node "8c9210b5-c593-a402-3a6b-2f25ad8b1319" have stopped"
2022-05-27T14:07:00.974Z [INFO]  internal_plugin.gce-mig: received node drain message: node_id=db69b914-f4eb-aef0-dc77-24626c2bd303 msg="Drain complete for node db69b914-f4eb-aef0-dc77-24626c2bd303"
2022-05-27T14:07:10.025Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 8c9210b5-c593-a402-3a6b-2f25ad8b1319 is draining"
2022-05-27T14:07:20.024Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 8c9210b5-c593-a402-3a6b-2f25ad8b1319 is draining"
2022-05-27T14:07:30.032Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 8c9210b5-c593-a402-3a6b-2f25ad8b1319 is draining"
2022-05-27T14:07:40.023Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 8c9210b5-c593-a402-3a6b-2f25ad8b1319 is draining"
2022-05-27T14:07:50.028Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 8c9210b5-c593-a402-3a6b-2f25ad8b1319 is draining"
2022-05-27T14:08:00.026Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 8c9210b5-c593-a402-3a6b-2f25ad8b1319 is draining"
2022-05-27T14:08:10.023Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 8c9210b5-c593-a402-3a6b-2f25ad8b1319 is draining"
2022-05-27T14:08:20.029Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 8c9210b5-c593-a402-3a6b-2f25ad8b1319 is draining"
2022-05-27T14:08:30.027Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 8c9210b5-c593-a402-3a6b-2f25ad8b1319 is draining"
2022-05-27T14:08:40.025Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 8c9210b5-c593-a402-3a6b-2f25ad8b1319 is draining"
2022-05-27T14:08:50.030Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 8c9210b5-c593-a402-3a6b-2f25ad8b1319 is draining"
2022-05-27T14:09:00.029Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 8c9210b5-c593-a402-3a6b-2f25ad8b1319 is draining"
2022-05-27T14:09:10.023Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 8c9210b5-c593-a402-3a6b-2f25ad8b1319 is draining"
2022-05-27T14:09:20.027Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 8c9210b5-c593-a402-3a6b-2f25ad8b1319 is draining"
2022-05-27T14:09:30.023Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 8c9210b5-c593-a402-3a6b-2f25ad8b1319 is draining"
2022-05-27T14:09:40.025Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 8c9210b5-c593-a402-3a6b-2f25ad8b1319 is draining"
2022-05-27T14:09:44.195Z [INFO]  internal_plugin.gce-mig: received node drain message: node_id=8c9210b5-c593-a402-3a6b-2f25ad8b1319 msg="Drain complete for node 8c9210b5-c593-a402-3a6b-2f25ad8b1319"
2022-05-27T14:09:44.867Z [INFO]  internal_plugin.gce-mig: successfully deleted GCE MIG instances: action=scale_in instance_group=nmdclient-prodclientdispatchv1

Logs in the autoscaler regarding the ineligible node 711e0706:

2022-05-27T14:01:50.854Z [INFO]  internal_plugin.gce-mig: received node drain message: node_id=711e0706-0b9b-12b1-32d9-7cd0d65a96d0 msg="All allocations on node "711e0706-0b9b-12b1-32d9-7cd0d65a96d0" have stopped"
2022-05-27T14:04:50.040Z [ERROR] policy_manager: encountered an error monitoring policy IDs: error="failed to call the Nomad list policies API: Unexpected response code: 504 (stream timeout)"
2022-05-27T14:06:50.206Z [WARN]  policy_eval.broker: eval delivery limit reached: eval_id=efd4e974-7cfe-dbbc-8fb6-a7b6cab8f28c policy_id=17bc361d-38ba-ca49-6a30-44a04d02f0dd token=093b0646-45cb-1280-e7a8-f6f071abb74e count=1 limit=1
2022-05-27T14:06:50.846Z [ERROR] policy_eval.worker: failed to evaluate policy: eval_id=efd4e974-7cfe-dbbc-8fb6-a7b6cab8f28c eval_token=093b0646-45cb-1280-e7a8-f6f071abb74e id=a24cd604-fc3c-d7e3-af84-b22d9c24ff73 policy_id=17bc361d-38ba-ca49-6a30-44a04d02f0dd queue=cluster error="failed to scale target: failed to perform scaling action: failed to perform pre-scale Nomad scale in tasks: context done while monitoring node drain: received error while draining node: Error monitoring node: Unexpected response code: 504 (stream timeout)"
2022-05-27T14:06:50.846Z [WARN]  policy_eval.worker: failed to NACK policy evaluation: eval_id=efd4e974-7cfe-dbbc-8fb6-a7b6cab8f28c eval_token=093b0646-45cb-1280-e7a8-f6f071abb74e id=a24cd604-fc3c-d7e3-af84-b22d9c24ff73 policy_id=17bc361d-38ba-ca49-6a30-44a04d02f0dd queue=cluster error="evaluation ID not found"
2022-05-27T14:06:51.843Z [WARN]  internal_plugin.gce-mig: node pool status readiness check failed: error="node 711e0706-0b9b-12b1-32d9-7cd0d65a96d0 is draining"

I hope this is sufficient to investigate how it gets into this state.

Thanks, VH

lgfa29 commented 2 years ago

Thanks for the new info! Unfortunately I haven't been able to figure out what could be wrong yet 🤔

That failed to NACK policy evaluation is quite suspicious though.