hashicorp / nomad-autoscaler

Nomad Autoscaler brings autoscaling to your Nomad workloads.
Mozilla Public License 2.0
426 stars 84 forks source link

Failed to get policy (EOF) #519

Closed evandam closed 1 year ago

evandam commented 3 years ago

Hi folks,

Seeing some more errors and Nomad jobs that should definitely be triggering scaling are not happening.

https://gist.github.com/evandam/4956199b01cd4303873b69cf7d3779c1

My first thought is that the Nomad agent is having issues, but it seems okay and I'm able to hit the same endpoints in the logs directly on the host 🤷‍♂️

As a side note, is there a decent way to debug things? There's policy_id in logs but I'm not sure how that maps back to the actual policy or Nomad job in question. It seems like it would be very helpful to be able to trace these back.

We're running Nomad 1.1.2 for what it's worth.

Any thoughts?

lgfa29 commented 3 years ago

Hum...it's the second time this is being reported, so there may be something wrong with our blocking query handling.

Do you happen to have anything sitting between Nomad and the Autoscaler? Like a load balancer or reverse proxy?

There's policy_id in logs but I'm not sure how that maps back to the actual policy or Nomad job in question

Yeah, it's not ideal but you can query Nomad for the scaling policies, either using the nomad scaling policy list command or the GET /v1/scaling/policies endpoint, to get the policy ID.

evandam commented 3 years ago

I'm running the autoscaler as a Nomad job itself, talking to the local Nomad agent. Here's the simplified HCL: https://gist.github.com/evandam/40430410b9bb18cca74aad3714e5f568

hntrmrrs commented 2 years ago

We have experienced a similar issue where a failure to list scaling policies ends up with the autoscaler effectively hanging, periodically emitting logs like these (every ~3 minutes with our configuration):

2022-03-02T13:30:11.354Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=c8449594-6c20-ca82-4bb5-869772a8aea1
2022-03-02T13:33:11.355Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=c8449594-6c20-ca82-4bb5-869772a8aea1
2022-03-02T13:36:11.355Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=c8449594-6c20-ca82-4bb5-869772a8aea1

The underlying cause of the EOF (which emits an error here: https://github.com/hashicorp/nomad-autoscaler/blob/main/policy/nomad/source.go#L125) was a bit more complicated to track down.

We've configured the autoscaler in a similar manner to @evandam above in that it communicates with the Nomad agent where it's running as a job. Our cluster is also configured to use TLS everywhere. When a Nomad server agent gets a SIGHUP for certificate renewal, it returns errors to client agents that are doing RPC. The client agent in turn returns a 500 to the Autoscaler API client which sends it into the aforementioned bad state. (returned here: https://github.com/hashicorp/nomad-autoscaler/blob/main/policy/nomad/source.go#L109). We've mitigated this by patching the autoscaler to os.Exit() when that particular error is encountered (letting Nomad re-schedule the job), but obviously that's less than ideal.

I'm trying to reliably reproduce this in a dev environment, but it's a bit tricky to get all the timings right so I mainly just have prod logs to go off now.

lgfa29 commented 2 years ago

Thanks for this extra info @hntrmrrs, super helpful.

I haven't been able to reproduce this yet, but it sounds like maybe a race condition when Nomad servers close client RPC connections?

If it happens again, could you restart the Autoscaler using a SIGABRT and send us the goroutine dump?

Lord-Y commented 2 years ago

Hello guys,

I'm having the same issue with autoscaler version 0.3.6. I created a deployment with a scaling and everything was working. Then I deleted it and I saw these errors:

{"@level":"error","@message":"timeout: failed to read policy in 
time","@module":"policy_manager.policy_handler","@timestamp":"2022-06-03T08:49:27.557602Z","policy_id":"e83ddfb5-981b-1060-12ef-ca874dc17791"}

The scaling policy id e83ddfb5-981b-1060-12ef-ca874dc17791 does not exist at all.

With the previous error I was also having this one:

{"@level":"error","@message":"encountered an error monitoring policy IDs","@module":"policy_manager","@timestamp":"2022-06-03T08:35:38.316593Z","error":"failed to call the Nomad list policies API: Unexpected response code: 504 (\u003chtml\u003e\r\n\u003chead\u003e\u003ctitle\u003e504 Gateway Time-out\u003c/title\u003e\u003c/head\u003e\r\n\u003cbody\u003e\r\n\u003ccenter\u003e\u003ch1\u003e504 Gateway Time-out\u003c/h1\u003e\u003c/center\u003e\r\n\u003chr\u003e\u003ccenter\u003enginx/1.18.0\u003c/center\u003e\r\n\u003c/body\u003e\r\n\u003c/html\u003e\r\n)"}

So instead of having errors like that, we should have a log level warning or even info saying that there is no scaling policies. After I put back the deployment, all these errors disappeared.

lgfa29 commented 2 years ago

Hi @Lord-Y 👋

Could you give a little bit more details on the steps you did to reproduce this? Did you run a job with a scaling block and then stopped it?

And are you accessing the Nomad API via a proxy? From the second error message there seems to be an nginx in front of it

504 Gateway Time-out

Thanks!

Lord-Y commented 2 years ago

@lgfa29 Yes I'm using nomad behind a nginx proxy. my config is

server {
  listen 80 default_server;
  server_name nomad;

  location / {
    proxy_pass https://127.0.0.1:4646;
    proxy_redirect     default;
    proxy_http_version 1.1;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Secured-Transport no;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_set_header Origin "${scheme}://${proxy_host}";
    proxy_ssl_certificate     /etc/nginx/.ssl/server.pem;
    proxy_ssl_certificate_key /etc/nginx/.ssl/server-key.pem;
    proxy_read_timeout    310s;
    proxy_buffering       off;
  }

  location /health {
    proxy_pass                https://127.0.0.1:4646/v1/status/peers;
    proxy_ssl_certificate     /etc/nginx/.ssl/server.pem;
    proxy_ssl_certificate_key /etc/nginx/.ssl/server-key.pem;
  }
}

My autoscaler config:

job "autoscaler" {
  region      = "[[.region]]"
  datacenters = ["[[.datacenter]]"]
  type        = "service"

  group "autoscaler" {
    count = 1

    network {
      port "http" {}
    }

    task "autoscaler" {
      vault {
        policies = ["XXXX-[[.platform]]-nomad-autoscaler"]
      }

      driver = "docker"

      config {
        image   = "hashicorp/nomad-autoscaler:0.3.7"
        command = "nomad-autoscaler"

        labels = {
          elasticsearch_logging_index = "nomad-autoscaler"
        }

        args = [
          "agent",
          "-config",
          "${NOMAD_TASK_DIR}/config.hcl",
          "-http-bind-address",
          "0.0.0.0",
          "-http-bind-port",
          "${NOMAD_PORT_http}",
          "-policy-dir",
          "${NOMAD_TASK_DIR}/policies/",
          "-log-level",
          "DEBUG",
          "-enable-debug",
          "-telemetry-prometheus-metrics",
          "-log-json",
        ]

        ports = ["http"]
      }

      template {
        data = <<EOF
nomad {
  address = "http://{{env "attr.unique.network.ip-address" }}"
{{ with secret "XXXX" }}
  token = "{{ .Data.data.token }}"
{{ end }}
}

apm "prometheus" {
  driver = "prometheus"
  config = {
    address = "https://XXXX-[[.platform]]-[[.datacenter]].XXXX"
  }
}

target "gce-mig" {
  driver = "gce-mig"
}

strategy "target-value" {
  driver = "target-value"
}
EOF

        destination = "${NOMAD_TASK_DIR}/config.hcl"
      }

      template {
        data = <<EOF
scaling "cluster_policy" {
  enabled = true
  min     = 3
  max     = 6

  policy {

    cooldown            = "2m"
    evaluation_interval = "1m"

    check "cpu_allocated_percentage" {
      source = "prometheus"
      query  = "sum(nomad_client_allocated_cpu{node_class=\"none\",region=\"[[.region]]\"}*100/(nomad_client_unallocated_cpu{node_class=\"none\",region=\"[[.region]]\"}+nomad_client_allocated_cpu{node_class=\"none\",region=\"[[.region]]\"}))/count(nomad_client_allocated_cpu{node_class=\"none\",region=\"[[.region]]\"})"

      strategy "target-value" {
        target = 80
      }
    }

    check "mem_allocated_percentage" {
      source = "prometheus"
      query  = "sum(nomad_client_allocated_memory{node_class=\"none\",region=\"[[.region]]\"}*100/(nomad_client_unallocated_memory{node_class=\"none\",region=\"[[.region]]\"}+nomad_client_allocated_memory{node_class=\"none\",region=\"[[.region]]\"}))/count(nomad_client_allocated_memory{node_class=\"none\",region=\"[[.region]]\"})"

      strategy "target-value" {
        target = 80
      }
    }

    target "gce-mig" {
      project             = "XXXX-[[.platform]]"
      region              = "[[.region]]"
      mig_name            = "XXXX-nomad-client-[[.datacenter]]-XXXX-[[.platform]]"
      node_class          = "none"
      node_drain_deadline = "5m"
      node_purge          = "true"
      node_drain_ignore_system_jobs = "true"
    }
  }
}
EOF

        destination = "${NOMAD_TASK_DIR}/policies/none.hcl"
      }

      resources {
        cpu    = 50
        memory = 128
      }

      service {
        name = "autoscaler"
        port = "http"

        check {
          type     = "http"
          path     = "/v1/health"
          interval = "5s"
          timeout  = "2s"
        }
      }
    }
  }
}

My dummy app:

job "webapp" {
  region      = "[[.region]]"
  datacenters = ["[[.datacenter]]"]
  type        = "service"

  group "demo" {
    count = 1

    network {
      port "webapp_http" {}
      port "toxiproxy_webapp" {}
    }

    scaling {
      enabled = true
      min     = 1
      max     = 20

      policy {
        cooldown            = "1m"
        evaluation_interval = "30s"

        check "avg_sessions" {
          source = "prometheus"
          query  = "sum(traefik_service_open_connections{service=~\".*webapp@consulcatalog\",region=\"[[.region]]\"})/scalar(nomad_nomad_job_summary_running{task_group=\"demo\",region=\"[[.region]]\"})"

          strategy "target-value" {
            target = 10
          }
        }
      }
    }

    task "webapp" {
      driver = "docker"

      config {
        image = "hashicorp/demo-webapp-lb-guide"
        ports = ["webapp_http"]
      }

      env {
        PORT    = "${NOMAD_PORT_webapp_http}"
        NODE_IP = "${NOMAD_IP_webapp_http}"
      }

      resources {
        cpu    = 500
        memory = 256
      }

      service {
        name = "webapp"
        port = "webapp_http"

        check {
          type     = "http"
          path     = "/"
          interval = "3s"
          timeout  = "1s"
        }
      }
    }

    task "toxiproxy" {
      driver = "docker"

      lifecycle {
        hook    = "prestart"
        sidecar = true
      }

      config {
        image      = "shopify/toxiproxy:2.1.4"
        entrypoint = ["/entrypoint.sh"]
        ports      = ["toxiproxy_webapp"]

        volumes = [
          "local/entrypoint.sh:/entrypoint.sh",
        ]
      }

      template {
        data = <<EOH
#!/bin/sh

set -ex

/go/bin/toxiproxy -host 0.0.0.0  &

while ! wget --spider -q http://localhost:8474/version; do
  echo "toxiproxy not ready yet"
  sleep 0.2
done

/go/bin/toxiproxy-cli create webapp -l 0.0.0.0:${NOMAD_PORT_toxiproxy_webapp} -u ${NOMAD_ADDR_webapp_http}
/go/bin/toxiproxy-cli toxic add -n latency -t latency -a latency=1000 -a jitter=500 webapp
tail -f /dev/null
        EOH

        destination = "local/entrypoint.sh"
        perms       = "755"
      }

      resources {
        cpu    = 100
        memory = 32
      }

      service {
        name = "toxiproxy-webapp"
        port = "toxiproxy_webapp"

        tags = [
          "traefik.enable=true",
          "traefik.http.routers.webapp.entrypoints=http",
          "traefik.http.routers.webapp.rule=Host(`XXXX`)",
          #"traefik.http.routers.webapp.rule=Path(`/`)",
        ]
      }
    }
  }
}

My scaling policy:

nomad acl policy info scaling
Name        = scaling
Description = scaling
Rules       = namespace "*" {
  policy = "write"
  capabilities = ["list-scaling-policies","read-scaling-policy","read-job-scaling","read-job","list-jobs","scale-job"]
}
agent {
  policy = "read"
}
node {
  policy = "write"
}

I don't use this dummy app anymore which means I deleted it. So the scaling policy does not exist anymore:

nomad scaling policy list
No policies found

I just checked right now the logs of the autoscaler and here they are:

{"@level":"error","@message":"encountered an error monitoring policy IDs","@module":"policy_manager","@timestamp":"2022-07-19T07:57:31.052593Z","error":"failed to call the Nomad list policies API: Unexpected response code: 504 (\u003chtml\u003e\r\n\u003chead\u003e\u003ctitle\u003e504 Gateway Time-out\u003c/title\u003e\u003c/head\u003e\r\n\u003cbody\u003e\r\n\u003ccenter\u003e\u003ch1\u003e504 Gateway Time-out\u003c/h1\u003e\u003c/center\u003e\r\n\u003chr\u003e\u003ccenter\u003enginx/1.18.0\u003c/center\u003e\r\n\u003c/body\u003e\r\n\u003c/html\u003e\r\n)"}

So @lgfa29 i just gave you all the steps to reproduce the issue :-). Let's me know if you need more explanations.

ddreier commented 2 years ago

We just found 6 different instances of nomad-autoscaler v0.3.6 in 6 different Nomad clusters all in this same "failed to read policy in time" state. Like @hntrmrrs mentioned, all 6 instances seem to start with receiving a 500 from the Nomad API:

2022-06-07T20:29:21.146Z [ERROR] policy_manager: encountered an error monitoring policy IDs: error="failed to call the Nomad list policies API: Unexpected response code: 500 (rpc error: EOF)"
2022-06-07T20:29:31.148Z [INFO]  policy_manager: starting policy source: source=nomad
2022-06-07T20:29:31.148Z [INFO]  policy_manager: starting policy source: source=file
2022-06-07T20:29:31.153Z [INFO]  file_policy_source: starting file policy monitor: file=/local/policies/aws-asg.hcl name=... policy_id=d7c916c9-4a0b-34db-d6ac-943abc74e1e7
2022-06-07T20:29:31.154Z [INFO]  file_policy_source: starting file policy monitor: file=/local/policies/aws-asg.hcl name=... policy_id=2830b7ec-4dfe-a6f5-1a8d-d012a7216233
2022-06-07T20:32:31.153Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=d7c916c9-4a0b-34db-d6ac-943abc74e1e7
2022-06-07T20:32:31.154Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=2830b7ec-4dfe-a6f5-1a8d-d012a7216233

I'm trying to confirm with some teammates, but I think this might have happened when we did an upgrade on the Nomad Servers.

If it happens again, could you restart the Autoscaler using a SIGABRT and send us the goroutine dump?

Here you go!
2022-07-19T18:44:52.664Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=46c49638-8dfd-b3e2-9340-9a637ed8d8da
2022-07-19T18:44:52.664Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ffbaf7d4-41c7-5703-fcee-a641f253a19f
2022-07-19T18:47:52.665Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=46c49638-8dfd-b3e2-9340-9a637ed8d8da
2022-07-19T18:47:52.665Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ffbaf7d4-41c7-5703-fcee-a641f253a19f
2022-07-19T18:50:52.665Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ffbaf7d4-41c7-5703-fcee-a641f253a19f
2022-07-19T18:50:52.665Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=46c49638-8dfd-b3e2-9340-9a637ed8d8da
2022-07-19T18:53:52.664Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=46c49638-8dfd-b3e2-9340-9a637ed8d8da
2022-07-19T18:53:52.664Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ffbaf7d4-41c7-5703-fcee-a641f253a19f
2022-07-19T18:56:52.664Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=46c49638-8dfd-b3e2-9340-9a637ed8d8da
2022-07-19T18:56:52.664Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ffbaf7d4-41c7-5703-fcee-a641f253a19f
2022-07-19T18:59:52.665Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ffbaf7d4-41c7-5703-fcee-a641f253a19f
2022-07-19T18:59:52.665Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=46c49638-8dfd-b3e2-9340-9a637ed8d8da
2022-07-19T19:02:52.666Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ffbaf7d4-41c7-5703-fcee-a641f253a19f
2022-07-19T19:02:52.666Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=46c49638-8dfd-b3e2-9340-9a637ed8d8da
SIGABRT: abort
PC=0x46aac1 m=0 sigcode=0
goroutine 0 [idle]:
runtime.futex()
    /usr/local/go/src/runtime/sys_linux_amd64.s:519 +0x21
runtime.futexsleep(0xffffffffffff8000, 0x2561be0, 0xc00004c800)
    /usr/local/go/src/runtime/os_linux.go:44 +0x36
runtime.notesleep(0x2561d30)
    /usr/local/go/src/runtime/lock_futex.go:160 +0x87
runtime.mPark()
    /usr/local/go/src/runtime/proc.go:1441 +0x2a
runtime.stoplockedm()
    /usr/local/go/src/runtime/proc.go:2602 +0x65
runtime.schedule()
    /usr/local/go/src/runtime/proc.go:3299 +0x3d
runtime.park_m(0xc000520680)
    /usr/local/go/src/runtime/proc.go:3516 +0x14d
runtime.mcall()
    /usr/local/go/src/runtime/asm_amd64.s:307 +0x43
goroutine 1 [chan receive, 1262 minutes]:
github.com/hashicorp/nomad-autoscaler/agent.(*Agent).handleSignals(0xc0003cdb20)
    /home/circleci/project/project/agent/agent.go:270 +0xe5
github.com/hashicorp/nomad-autoscaler/agent.(*Agent).Run(0xc0003cdb20)
    /home/circleci/project/project/agent/agent.go:100 +0x446
github.com/hashicorp/nomad-autoscaler/command.(*AgentCommand).Run(0xc000219a40, {0xc00003a0a0, 0x203000, 0x0})
    /home/circleci/project/project/command/agent.go:333 +0x8e5
github.com/mitchellh/cli.(*CLI).Run(0xc0003a5400)
    /go/pkg/mod/github.com/mitchellh/cli@v1.1.2/cli.go:262 +0x5f8
main.main()
    /home/circleci/project/project/main.go:26 +0x24d
goroutine 7 [select]:
go.opencensus.io/stats/view.(*worker).start(0xc0000f4980)
    /go/pkg/mod/go.opencensus.io@v0.22.5/stats/view/worker.go:276 +0xb9
created by go.opencensus.io/stats/view.init.0
    /go/pkg/mod/go.opencensus.io@v0.22.5/stats/view/worker.go:34 +0x92
goroutine 10 [IO wait]:
internal/poll.runtime_pollWait(0x7f21944c2e58, 0x72)
    /usr/local/go/src/runtime/netpoll.go:234 +0x89
internal/poll.(*pollDesc).wait(0xc0000f5800, 0x416f46, 0x0)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc0000f5800)
    /usr/local/go/src/internal/poll/fd_unix.go:402 +0x22c
net.(*netFD).accept(0xc0000f5800)
    /usr/local/go/src/net/fd_unix.go:173 +0x35
net.(*TCPListener).accept(0xc00039d170)
    /usr/local/go/src/net/tcpsock_posix.go:140 +0x28
net.(*TCPListener).Accept(0xc00039d170)
    /usr/local/go/src/net/tcpsock.go:262 +0x3d
net/http.(*Server).Serve(0xc0001e1340, {0x1a5bd18, 0xc00039d170})
    /usr/local/go/src/net/http/server.go:3002 +0x394
github.com/hashicorp/nomad-autoscaler/agent/http.(*Server).Start(0xc0002df360)
    /home/circleci/project/project/agent/http/server.go:124 +0xd2
created by github.com/hashicorp/nomad-autoscaler/command.(*AgentCommand).Run
    /home/circleci/project/project/command/agent.go:330 +0x896
goroutine 34 [chan receive]:
github.com/hashicorp/nomad-autoscaler/plugins/builtin/target/nomad/plugin.(*TargetPlugin).garbageCollectionLoop(0xc000469740)
    /home/circleci/project/project/plugins/builtin/target/nomad/plugin/plugin.go:213 +0xc9
created by github.com/hashicorp/nomad-autoscaler/plugins/builtin/target/nomad/plugin.(*TargetPlugin).SetConfig
    /home/circleci/project/project/plugins/builtin/target/nomad/plugin/plugin.go:94 +0xef
goroutine 20 [syscall, 1263 minutes]:
os/signal.signal_recv()
    /usr/local/go/src/runtime/sigqueue.go:169 +0x98
os/signal.loop()
    /usr/local/go/src/os/signal/signal_unix.go:24 +0x19
created by os/signal.Notify.func1.1
    /usr/local/go/src/os/signal/signal.go:151 +0x2c
goroutine 21 [select, 100962 minutes]:
github.com/armon/go-metrics.(*InmemSignal).run(0xc00053e240)
    /go/pkg/mod/github.com/armon/go-metrics@v0.3.7/inmem_signal.go:64 +0x6c
created by github.com/armon/go-metrics.NewInmemSignal
    /go/pkg/mod/github.com/armon/go-metrics@v0.3.7/inmem_signal.go:38 +0x174
goroutine 23 [sleep]:
time.Sleep(0x3b9aca00)
    /usr/local/go/src/runtime/time.go:193 +0x12e
github.com/armon/go-metrics.(*Metrics).collectStats(0xc0004f2000)
    /go/pkg/mod/github.com/armon/go-metrics@v0.3.7/metrics.go:230 +0x25
created by github.com/armon/go-metrics.New
    /go/pkg/mod/github.com/armon/go-metrics@v0.3.7/start.go:84 +0x17b
goroutine 25 [select, 100962 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
    /home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
    /home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a1e0, {0x1a6fb30, 0xc000400c00})
    /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
    /home/circleci/project/project/agent/agent.go:128 +0x25f
goroutine 26 [select, 100962 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
    /home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
    /home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a230, {0x1a6fb30, 0xc000400c00})
    /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
    /home/circleci/project/project/agent/agent.go:128 +0x25f
goroutine 27 [select, 100962 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
    /home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
    /home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a280, {0x1a6fb30, 0xc000400c00})
    /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
    /home/circleci/project/project/agent/agent.go:128 +0x25f
goroutine 28 [select, 100962 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
    /home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
    /home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a2d0, {0x1a6fb30, 0xc000400c00})
    /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
    /home/circleci/project/project/agent/agent.go:128 +0x25f
goroutine 29 [select, 100962 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
    /home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
    /home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a320, {0x1a6fb30, 0xc000400c00})
    /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
    /home/circleci/project/project/agent/agent.go:128 +0x25f
goroutine 30 [select, 100962 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
    /home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
    /home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a370, {0x1a6fb30, 0xc000400c00})
    /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
    /home/circleci/project/project/agent/agent.go:128 +0x25f
goroutine 31 [select, 100962 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
    /home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
    /home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a3c0, {0x1a6fb30, 0xc000400c00})
    /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
    /home/circleci/project/project/agent/agent.go:128 +0x25f
goroutine 32 [select, 100962 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
    /home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
    /home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a410, {0x1a6fb30, 0xc000400c00})
    /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
    /home/circleci/project/project/agent/agent.go:128 +0x25f
goroutine 49 [select, 100962 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
    /home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
    /home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a460, {0x1a6fb30, 0xc000400c00})
    /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
    /home/circleci/project/project/agent/agent.go:128 +0x25f
goroutine 50 [select, 100962 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
    /home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa})
    /home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a4b0, {0x1a6fb30, 0xc000400c00})
    /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
    /home/circleci/project/project/agent/agent.go:128 +0x25f
goroutine 51 [select, 59053 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
    /home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
    /home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a500, {0x1a6fb30, 0xc000400c00})
    /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
    /home/circleci/project/project/agent/agent.go:134 +0x35b
goroutine 52 [select, 59052 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
    /home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
    /home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a550, {0x1a6fb30, 0xc000400c00})
    /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
    /home/circleci/project/project/agent/agent.go:134 +0x35b
goroutine 53 [select, 59052 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
    /home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
    /home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a5a0, {0x1a6fb30, 0xc000400c00})
    /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
    /home/circleci/project/project/agent/agent.go:134 +0x35b
goroutine 54 [select, 59051 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
    /home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
    /home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a5f0, {0x1a6fb30, 0xc000400c00})
    /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
    /home/circleci/project/project/agent/agent.go:134 +0x35b
goroutine 55 [select, 59051 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
    /home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
    /home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a640, {0x1a6fb30, 0xc000400c00})
    /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
    /home/circleci/project/project/agent/agent.go:134 +0x35b
goroutine 56 [select, 59055 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
    /home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
    /home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a690, {0x1a6fb30, 0xc000400c00})
    /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
    /home/circleci/project/project/agent/agent.go:134 +0x35b
goroutine 57 [select, 59055 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
    /home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
    /home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a6e0, {0x1a6fb30, 0xc000400c00})
    /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
    /home/circleci/project/project/agent/agent.go:134 +0x35b
goroutine 58 [select, 59054 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
    /home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
    /home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a730, {0x1a6fb30, 0xc000400c00})
    /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
    /home/circleci/project/project/agent/agent.go:134 +0x35b
goroutine 59 [select, 59055 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
    /home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
    /home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a780, {0x1a6fb30, 0xc000400c00})
    /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
    /home/circleci/project/project/agent/agent.go:134 +0x35b
goroutine 60 [select, 59053 minutes]:
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
    /home/circleci/project/project/policyeval/broker.go:246 +0x1e5
github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7})
    /home/circleci/project/project/policyeval/broker.go:176 +0x19a
github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a7d0, {0x1a6fb30, 0xc000400c00})
    /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers
    /home/circleci/project/project/agent/agent.go:134 +0x35b
goroutine 61 [select, 59051 minutes]:
github.com/hashicorp/nomad-autoscaler/agent.(*Agent).runEvalHandler(0xc0003cdb20, {0x1a6fb30, 0xc000400c00}, 0xc000366cc0)
    /home/circleci/project/project/agent/agent.go:106 +0xb2
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).Run
    /home/circleci/project/project/agent/agent.go:97 +0x439
goroutine 5811358 [select, 5 minutes]:
net/http.(*persistConn).roundTrip(0xc00053a000, 0xc0006f22c0)
    /usr/local/go/src/net/http/transport.go:2614 +0x97d
net/http.(*Transport).roundTrip(0xc0001e48c0, 0xc000191100)
    /usr/local/go/src/net/http/transport.go:594 +0x7d1
net/http.(*Transport).RoundTrip(0x30, 0x1a3c1e0)
    /usr/local/go/src/net/http/roundtrip.go:18 +0x19
net/http.send(0xc000191100, {0x1a3c1e0, 0xc0001e48c0}, {0x17ff560, 0x40ce01, 0x0})
    /usr/local/go/src/net/http/client.go:252 +0x5d8
net/http.(*Client).send(0xc000465e00, 0xc000191100, {0x203000, 0x203000, 0x0})
    /usr/local/go/src/net/http/client.go:176 +0x9b
net/http.(*Client).do(0xc000465e00, 0xc000191100)
    /usr/local/go/src/net/http/client.go:725 +0x908
net/http.(*Client).Do(...)
    /usr/local/go/src/net/http/client.go:593
github.com/hashicorp/nomad/api.(*Client).doRequest(0xc0003b21c0, 0xc0000e06c0)
    /go/pkg/mod/github.com/hashicorp/nomad/api@v0.0.0-20210503143957-4ccada7924cf/api.go:722 +0x65
github.com/hashicorp/nomad/api.(*Client).query(0xc0004c37b0, {0x184fced, 0xc0006920c0}, {0x1478d40, 0xc00043adc8}, 0x464a00)
    /go/pkg/mod/github.com/hashicorp/nomad/api@v0.0.0-20210503143957-4ccada7924cf/api.go:841 +0xb4
github.com/hashicorp/nomad/api.(*Scaling).ListPolicies(0xc0004c37b0, 0xc00043e760)
    /go/pkg/mod/github.com/hashicorp/nomad/api@v0.0.0-20210503143957-4ccada7924cf/scaling.go:20 +0x65
github.com/hashicorp/nomad-autoscaler/policy/nomad.(*Source).MonitorIDs.func1()
    /home/circleci/project/project/policy/nomad/source.go:109 +0xca
created by github.com/hashicorp/nomad-autoscaler/policy/nomad.(*Source).MonitorIDs
    /home/circleci/project/project/policy/nomad/source.go:102 +0x1e6
goroutine 5206364 [select]:
github.com/hashicorp/nomad-autoscaler/policy.(*Handler).Run(0xc00021de00, {0x1a6fb30, 0xc000400c00}, 0xc0002a8fd0)
    /home/circleci/project/project/policy/handler.go:118 +0x411
github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run.func1({0xc0000406f0, 0x24})
    /home/circleci/project/project/policy/manager.go:116 +0x46
created by github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run
    /home/circleci/project/project/policy/manager.go:115 +0xed3
goroutine 5206401 [select, 1263 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run(0xc00036a140, {0x1a6fb30, 0xc000400c00}, 0xc000366cc0)
    /home/circleci/project/project/policy/manager.go:72 +0x475
created by github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run
    /home/circleci/project/project/policy/manager.go:154 +0x847
goroutine 5206403 [select, 1263 minutes]:
github.com/hashicorp/nomad-autoscaler/policy/file.(*Source).MonitorIDs(0xc0004b6100, {0x1a6fb30, 0xc000400340}, {0xc00035e360, 0xc00035e1e0})
    /home/circleci/project/project/policy/file/source.go:84 +0x131
created by github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run
    /home/circleci/project/project/policy/manager.go:67 +0x1f3
goroutine 5206404 [select, 5 minutes]:
github.com/hashicorp/nomad-autoscaler/policy/nomad.(*Source).MonitorIDs(0xc00037a240, {0x1a6fb30, 0xc000400340}, {0xc00035e360, 0xc00035e1e0})
    /home/circleci/project/project/policy/nomad/source.go:113 +0x26d
created by github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run
    /home/circleci/project/project/policy/manager.go:67 +0x1f3
goroutine 5206410 [select, 1263 minutes]:
github.com/hashicorp/nomad-autoscaler/policy/file.(*Source).MonitorPolicy(0xc0004b6100, {0x1a6fb30, 0xc0004005c0}, {{0xc0000406f0, 0x24}, 0xc00007e720, 0xc00007e840, 0xc00007e6c0})
    /home/circleci/project/project/policy/file/source.go:149 +0x7bc
created by github.com/hashicorp/nomad-autoscaler/policy.(*Handler).Run
    /home/circleci/project/project/policy/handler.go:115 +0x2e5
goroutine 5206402 [select]:
github.com/hashicorp/nomad-autoscaler/policy.(*Manager).periodicMetricsReporter(0xc00036a140, {0x1a6fb30, 0xc000400340}, 0xc000366cc0)
    /home/circleci/project/project/policy/manager.go:224 +0xdf
created by github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run
    /home/circleci/project/project/policy/manager.go:61 +0x196
goroutine 5811386 [IO wait, 5 minutes]:
internal/poll.runtime_pollWait(0x7f21944c2c88, 0x72)
    /usr/local/go/src/runtime/netpoll.go:234 +0x89
internal/poll.(*pollDesc).wait(0xc000466f00, 0xc000131800, 0x0)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000466f00, {0xc000131800, 0xb09, 0xb09})
    /usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc000466f00, {0xc000131800, 0xc00008d758, 0xc00008d758})
    /usr/local/go/src/net/fd_posix.go:56 +0x29
net.(*conn).Read(0xc0003720e0, {0xc000131800, 0xc000315800, 0xc00008d7f8})
    /usr/local/go/src/net/net.go:183 +0x45
crypto/tls.(*atLeastReader).Read(0xc00043ae70, {0xc000131800, 0x0, 0x409b8d})
    /usr/local/go/src/crypto/tls/conn.go:777 +0x3d
bytes.(*Buffer).ReadFrom(0xc0001c4978, {0x1a39800, 0xc00043ae70})
    /usr/local/go/src/bytes/buffer.go:204 +0x98
crypto/tls.(*Conn).readFromUntil(0xc0001c4700, {0x1a3c0c0, 0xc0003720e0}, 0x183be26)
    /usr/local/go/src/crypto/tls/conn.go:799 +0xe5
crypto/tls.(*Conn).readRecordOrCCS(0xc0001c4700, 0x0)
    /usr/local/go/src/crypto/tls/conn.go:606 +0x112
crypto/tls.(*Conn).readRecord(...)
    /usr/local/go/src/crypto/tls/conn.go:574
crypto/tls.(*Conn).Read(0xc0001c4700, {0xc000356000, 0x1000, 0x10100c0004ad800})
    /usr/local/go/src/crypto/tls/conn.go:1277 +0x16f
net/http.(*persistConn).Read(0xc00053a000, {0xc000356000, 0x40505d, 0x60})
    /usr/local/go/src/net/http/transport.go:1926 +0x4e
bufio.(*Reader).fill(0xc0002b0a80)
    /usr/local/go/src/bufio/bufio.go:101 +0x103
bufio.(*Reader).Peek(0xc0002b0a80, 0x1)
    /usr/local/go/src/bufio/bufio.go:139 +0x5d
net/http.(*persistConn).readLoop(0xc00053a000)
    /usr/local/go/src/net/http/transport.go:2087 +0x1ac
created by net/http.(*Transport).dialConn
    /usr/local/go/src/net/http/transport.go:1747 +0x1e05
goroutine 5811243 [IO wait]:
internal/poll.runtime_pollWait(0x7f21944c28e8, 0x72)
    /usr/local/go/src/runtime/netpoll.go:234 +0x89
internal/poll.(*pollDesc).wait(0xc000466e80, 0xc000122000, 0x0)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000466e80, {0xc000122000, 0x1000, 0x1000})
    /usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc000466e80, {0xc000122000, 0x4cee66, 0x2560c80})
    /usr/local/go/src/net/fd_posix.go:56 +0x29
net.(*conn).Read(0xc00000ed40, {0xc000122000, 0x0, 0xc000732758})
    /usr/local/go/src/net/net.go:183 +0x45
net/http.(*connReader).Read(0xc000732750, {0xc000122000, 0x1000, 0x1000})
    /usr/local/go/src/net/http/server.go:780 +0x16d
bufio.(*Reader).fill(0xc000692060)
    /usr/local/go/src/bufio/bufio.go:101 +0x103
bufio.(*Reader).Peek(0xc000692060, 0x4)
    /usr/local/go/src/bufio/bufio.go:139 +0x5d
net/http.(*conn).serve(0xc000422000, {0x1a6fbd8, 0xc00042ede0})
    /usr/local/go/src/net/http/server.go:1955 +0xc36
created by net/http.(*Server).Serve
    /usr/local/go/src/net/http/server.go:3034 +0x4e8
goroutine 5811387 [select, 5 minutes]:
net/http.(*persistConn).writeLoop(0xc00053a000)
    /usr/local/go/src/net/http/transport.go:2386 +0xfb
created by net/http.(*Transport).dialConn
    /usr/local/go/src/net/http/transport.go:1748 +0x1e65
goroutine 5206365 [select]:
github.com/hashicorp/nomad-autoscaler/policy.(*Handler).Run(0xc00021df40, {0x1a6fb30, 0xc000400c00}, 0x0)
    /home/circleci/project/project/policy/handler.go:118 +0x411
github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run.func1({0xc000040720, 0x24})
    /home/circleci/project/project/policy/manager.go:116 +0x46
created by github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run
    /home/circleci/project/project/policy/manager.go:115 +0xed3
goroutine 5206366 [select, 1262 minutes]:
github.com/hashicorp/nomad-autoscaler/policy/file.(*Source).MonitorPolicy(0xc0004b6100, {0x1a6fb30, 0xc0006f35c0}, {{0xc000040720, 0x24}, 0xc00007e900, 0xc00007ea80, 0xc00007e8a0})
    /home/circleci/project/project/policy/file/source.go:149 +0x7bc
created by github.com/hashicorp/nomad-autoscaler/policy.(*Handler).Run
    /home/circleci/project/project/policy/handler.go:115 +0x2e5
rax    0xca
rbx    0x0
rcx    0xffffffffffffffff
rdx    0x0
rdi    0x2561d30
rsi    0x80
rbp    0x7fffd8ad9100
rsp    0x7fffd8ad90b8
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0x43e700
r13    0x0
r14    0x2561400
r15    0x0
rip    0x46aac1
rflags 0x286
cs     0x33
fs     0x0
gs     0x0

Maybe it would make sense to have the health check start to fail when there's any kind failure to read policies?

leosunmo commented 2 years ago

We are also affected by this, quite frequently, at least once a week.

The one that just occurred also seems to have started after a 500. Something goes wrong when trying to list policies and then it never comes right again.

2022-09-17T12:22:46.020Z [ERROR] policy_manager: encountered an error monitoring policy IDs: error="failed to call the Nomad list policies API: Get "https://nomad.cluster.net/v1/scaling/policies?index=1&namespace=default&region=global&wait=300000ms": EOF"

Somehow the policies both become nil and it can't seem to fetch them again?

As you can see from the log dump below, we refresh the config quite frequently because of expiring tokens, and that doesn't fix it either, only a restart does.

Longer log segment ``` 2022-09-17T12:20:07.192Z [INFO] agent: caught signal: signal=hangup 2022-09-17T12:20:07.192Z [INFO] agent: reloading Autoscaler configuration 2022-09-17T12:20:07.193Z [INFO] file_policy_source: file policy source ID monitor received reload signal 2022-09-17T12:20:07.195Z [INFO] file_policy_source: file policy source monitor received reload signal: file=local/policies/scaling_policy.hcl name=overflow_cluster_policy policy_id=238876ec-7a5f-d994-8852-81d6bb4aea77 2022-09-17T12:20:07.195Z [INFO] file_policy_source: file policy source monitor received reload signal: file=local/policies/scaling_policy.hcl name=stable_cluster_policy policy_id=ea060e2d-adff-f8a3-6934-3e3a461de601 2022-09-17T12:22:46.020Z [ERROR] policy_manager: encountered an error monitoring policy IDs: error="failed to call the Nomad list policies API: Get "https://nomad.cluster.net/v1/scaling/policies?index=1&namespace=default®ion=global&wait=300000ms": EOF" 2022-09-17T12:22:56.020Z [INFO] policy_manager: starting policy source: source=file 2022-09-17T12:22:56.020Z [INFO] policy_manager: starting policy source: source=nomad 2022-09-17T12:22:56.024Z [INFO] file_policy_source: starting file policy monitor: file=local/policies/scaling_policy.hcl name=stable_cluster_policy policy_id=ea060e2d-adff-f8a3-6934-3e3a461de601 2022-09-17T12:22:56.026Z [INFO] file_policy_source: starting file policy monitor: file=local/policies/scaling_policy.hcl name=overflow_cluster_policy policy_id=238876ec-7a5f-d994-8852-81d6bb4aea77 2022-09-17T12:25:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ea060e2d-adff-f8a3-6934-3e3a461de601 2022-09-17T12:25:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=238876ec-7a5f-d994-8852-81d6bb4aea77 2022-09-17T12:28:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ea060e2d-adff-f8a3-6934-3e3a461de601 2022-09-17T12:28:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=238876ec-7a5f-d994-8852-81d6bb4aea77 2022-09-17T12:31:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=238876ec-7a5f-d994-8852-81d6bb4aea77 2022-09-17T12:31:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ea060e2d-adff-f8a3-6934-3e3a461de601 2022-09-17T12:34:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=238876ec-7a5f-d994-8852-81d6bb4aea77 2022-09-17T12:34:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ea060e2d-adff-f8a3-6934-3e3a461de601 2022-09-17T12:37:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=238876ec-7a5f-d994-8852-81d6bb4aea77 2022-09-17T12:37:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ea060e2d-adff-f8a3-6934-3e3a461de601 2022-09-17T12:40:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=238876ec-7a5f-d994-8852-81d6bb4aea77 2022-09-17T12:40:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ea060e2d-adff-f8a3-6934-3e3a461de601 2022-09-17T12:43:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ea060e2d-adff-f8a3-6934-3e3a461de601 2022-09-17T12:43:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=238876ec-7a5f-d994-8852-81d6bb4aea77 2022-09-17T12:46:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=238876ec-7a5f-d994-8852-81d6bb4aea77 2022-09-17T12:46:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ea060e2d-adff-f8a3-6934-3e3a461de601 2022-09-17T12:49:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ea060e2d-adff-f8a3-6934-3e3a461de601 2022-09-17T12:49:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=238876ec-7a5f-d994-8852-81d6bb4aea77 2022-09-17T12:52:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=238876ec-7a5f-d994-8852-81d6bb4aea77 2022-09-17T12:52:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ea060e2d-adff-f8a3-6934-3e3a461de601 2022-09-17T12:53:40.215Z [INFO] agent: caught signal: signal=hangup 2022-09-17T12:53:40.215Z [INFO] agent: reloading Autoscaler configuration 2022-09-17T12:53:40.216Z [INFO] file_policy_source: file policy source ID monitor received reload signal 2022-09-17T12:53:40.219Z [INFO] file_policy_source: file policy source monitor received reload signal: file=local/policies/scaling_policy.hcl name=overflow_cluster_policy policy_id=238876ec-7a5f-d994-8852-81d6bb4aea77 2022-09-17T12:53:40.219Z [INFO] file_policy_source: file policy source monitor received reload signal: file=local/policies/scaling_policy.hcl name=stable_cluster_policy policy_id=ea060e2d-adff-f8a3-6934-3e3a461de601 2022-09-17T12:55:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=238876ec-7a5f-d994-8852-81d6bb4aea77 2022-09-17T12:55:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ea060e2d-adff-f8a3-6934-3e3a461de601 2022-09-17T12:58:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=238876ec-7a5f-d994-8852-81d6bb4aea77 2022-09-17T12:58:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ea060e2d-adff-f8a3-6934-3e3a461de601 2022-09-17T13:01:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ea060e2d-adff-f8a3-6934-3e3a461de601 2022-09-17T13:01:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=238876ec-7a5f-d994-8852-81d6bb4aea77 2022-09-17T13:04:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ea060e2d-adff-f8a3-6934-3e3a461de601 2022-09-17T13:04:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=238876ec-7a5f-d994-8852-81d6bb4aea77 2022-09-17T13:07:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=238876ec-7a5f-d994-8852-81d6bb4aea77 2022-09-17T13:07:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ea060e2d-adff-f8a3-6934-3e3a461de601 2022-09-17T13:10:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=238876ec-7a5f-d994-8852-81d6bb4aea77 2022-09-17T13:10:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ea060e2d-adff-f8a3-6934-3e3a461de601 2022-09-17T13:13:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ea060e2d-adff-f8a3-6934-3e3a461de601 2022-09-17T13:13:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=238876ec-7a5f-d994-8852-81d6bb4aea77 2022-09-17T13:16:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=238876ec-7a5f-d994-8852-81d6bb4aea77 2022-09-17T13:16:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ea060e2d-adff-f8a3-6934-3e3a461de601 2022-09-17T13:19:56.022Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ea060e2d-adff-f8a3-6934-3e3a461de601 2022-09-17T13:19:56.022Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=238876ec-7a5f-d994-8852-81d6bb4aea77 2022-09-17T13:22:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=238876ec-7a5f-d994-8852-81d6bb4aea77 2022-09-17T13:22:56.023Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ea060e2d-adff-f8a3-6934-3e3a461de601 2022-09-17T13:25:56.022Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=238876ec-7a5f-d994-8852-81d6bb4aea77 2022-09-17T13:25:56.022Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ea060e2d-adff-f8a3-6934-3e3a461de601 ```

I've only looked for a few minutes at the code, so won't be of any help, but I'd really like to see a fix or mitigation for this soon! It's a bit of a showstopper!

lgfa29 commented 2 years ago

Thanks for the updates everyone! This has been a very tricky bug the pinpoint because I suspect it may be related to problems both in the Autoscaler and in the Nomad API.

I know it's frustrating to not have a fix for this common occurrence, and we appreciate the patience. Keep the reports and additional information coming, they are all very helpful!

atulsingh0 commented 1 year ago

Is there a way to dump these logs in a file within the container and have some kind of check on that file for the Error line? This solution is just a workaround to restart the POD if the log contains the error.

2022-06-07T20:32:31.153Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=d7c916c9-4a0b-34db-d6ac-943abc74e1e7

I've tried below statement in Dockerfile but file doesn't capturing the logs -

RUN touch /var/log/nomad-autoscaler.log \
&& ln -sf /proc/$$/fd/1 /var/log/nomad-autoscaler.log

Any suggestions?

peter-lockhart-pub commented 1 year ago

Hi, just tagging along that we are also seeing this. The repo for us was our Nomad server got redeployed which caused the autoscaler to log failed to read policy in time for each policy evaluation until the deployment was restarted.

Is there a way we can detect these policy failures so we can proactively react to them before our system hits capacity issues due to lack of scaling? We are scraping prometheus metrics and I can see metrics for e.g. how long policies take to run etc, bit I don't see anything that shows the resolution for a policy evaluation. Are there any metrics we could use to detect if a policy fails to run to completion successfully, so we could build alerting rules?

atulsingh0 commented 1 year ago

hi @peter-lockhart-pub and others, this is how I have put a hack to resolve the problem which actually kill the pod if error pops up in the log -

Added these 2 scripts in a config map

configmap.yaml ```yaml apiVersion: v1 kind: ConfigMap metadata: labels: app: nomad-autoscaler name: nomad-autoscaler data: entrypoint.sh: | $@ 2>&1 | tee -a /logs/nomad-autoscaler.log liveness-on-log.sh: | if tail /logs/nomad-autoscaler.log | grep -q 'policy_manager.policy_handler: timeout: failed to read policy in time' then echo "Killing the 'nomad-autoscaler' process to restart the POD" kill $(ps -ef | grep "nomad-autoscaler agent" | grep -v "/scripts/" | head -1 | awk '{ print $1}') else echo "liveness-on-log check is passed." fi ```

Next, mount this config map on /scripts volume and created an emptyDir volume /logs. My nomad-autoscaler deployment YAML somewhat looks like the below -

deployment.yaml ```yaml ..... ..... spec: containers: - args: - nomad-autoscaler - agent - -config - /etc/nomad-autoscaler/conf/shared - -config - /etc/nomad-autoscaler/conf/secret command: - sh - /scripts/entrypoint.sh image: hashicorp/nomad-autoscaler:0.3.7 name: nomad-autoscaler imagePullPolicy: IfNotPresent livenessProbe: exec: command: - sh - /scripts/liveness-on-log.sh failureThreshold: 3 initialDelaySeconds: 5 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 volumeMounts: - mountPath: /logs name: vol - mountPath: /scripts name: scripts ..... ..... volumes: - emptyDir: sizeLimit: 100Mi name: vol - configMap: defaultMode: 420 items: - key: liveness-on-log.sh path: liveness-on-log.sh - key: entrypoint.sh path: entrypoint.sh name: nomad-autoscaler name: scripts .... .... ```

How it works:

lgfa29 commented 1 year ago

Hi @atulsingh0 👋

Your case may be a different error, do you see the EOF message at the end of the error log?

@peter-lockhart-pub that's a good question. Since I haven't been able to reproduce this yet I can't tell for sure, but maybe nomad-autoscaler.policy.source.error_count would increase when the EOF error happens?

DanieleGravina commented 1 year ago

For us, this problem appears when the nomad server node restarts (for instance, manually restarting it with systemctl restart nomad). I was able to reproduce it multiple times and the outcome was always the following:

2023-07-10T13:36:54.319Z [ERROR] policy_manager: encountered an error monitoring policy IDs: error="failed to call the Nomad list policies API: Unexpected response code: 500 (rpc error: EOF)" 2023-07-10T13:37:04.320Z [INFO] policy_manager: starting policy source: source=file 2023-07-10T13:37:04.320Z [INFO] policy_manager: starting policy source: source=nomad 2023-07-10T13:37:04.321Z [INFO] file_policy_source: starting file policy monitor: file=local/policies/batch.hcl name=batch policy_id=xxx 2023-07-10T13:40:04.321Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=xxx 2023-07-10T13:43:04.321Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=xxx 2023-07-10T13:46:04.321Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=xxx 2023-07-10T13:49:04.321Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=xxx

A possible quick workaround was inspired by atulsingh0:

Disclaimer: I came out with this fast-and-furious solution to restart the autoscaler job but it's far from perfect and it works only if the autoscaler logs are in the file autoscaler.stderr.0. Use it at your own risk

service {

        check {
          type    = "script"
          command = "/bin/sh"
          args    = ["-c", "! tail -n 1 alloc/logs/autoscaler.stderr.0 | grep -q 'policy_manager.policy_handler: timeout: failed to read policy in time'"]
          interval = "60s"
          timeout  = "10s"

          check_restart {
            limit = 3
            grace = "90s"
            ignore_warnings = false
          }
        }
      }
gulducat commented 1 year ago

Heya folks, apologies for the long delay here, thanks for all your input!

PR #733 is up for "failed to read policy in time" errors (with a nod to @Seros), specifically those occurring in file policies, which I think is the bulk of the concern in this issue. File policies getting "stuck" like this can be caused by various different expressions of Nomad API errors -- at the moment, any error when listing (Nomad) policies that includes "EOF" or "connection refused" -- that triggers a soft re-start of the internal policy manager, which then fails to re-load file policies properly.

I think some other issues were brought up in this issue too, so if this does not address your situation, please let us know! We may suggest opening a new issue to keep things tidy, but we don't want to lose track of important errors.