Closed evandam closed 1 year ago
Hum...it's the second time this is being reported, so there may be something wrong with our blocking query handling.
Do you happen to have anything sitting between Nomad and the Autoscaler? Like a load balancer or reverse proxy?
There's policy_id in logs but I'm not sure how that maps back to the actual policy or Nomad job in question
Yeah, it's not ideal but you can query Nomad for the scaling policies, either using the nomad scaling policy list
command or the GET /v1/scaling/policies
endpoint, to get the policy ID.
I'm running the autoscaler as a Nomad job itself, talking to the local Nomad agent. Here's the simplified HCL: https://gist.github.com/evandam/40430410b9bb18cca74aad3714e5f568
We have experienced a similar issue where a failure to list scaling policies ends up with the autoscaler effectively hanging, periodically emitting logs like these (every ~3 minutes with our configuration):
2022-03-02T13:30:11.354Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=c8449594-6c20-ca82-4bb5-869772a8aea1
2022-03-02T13:33:11.355Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=c8449594-6c20-ca82-4bb5-869772a8aea1
2022-03-02T13:36:11.355Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=c8449594-6c20-ca82-4bb5-869772a8aea1
The underlying cause of the EOF (which emits an error here: https://github.com/hashicorp/nomad-autoscaler/blob/main/policy/nomad/source.go#L125) was a bit more complicated to track down.
We've configured the autoscaler in a similar manner to @evandam above in that it communicates with the Nomad agent where it's running as a job. Our cluster is also configured to use TLS everywhere. When a Nomad server agent gets a SIGHUP
for certificate renewal, it returns errors to client agents that are doing RPC. The client agent in turn returns a 500 to the Autoscaler API client which sends it into the aforementioned bad state. (returned here: https://github.com/hashicorp/nomad-autoscaler/blob/main/policy/nomad/source.go#L109). We've mitigated this by patching the autoscaler to os.Exit()
when that particular error is encountered (letting Nomad re-schedule the job), but obviously that's less than ideal.
I'm trying to reliably reproduce this in a dev environment, but it's a bit tricky to get all the timings right so I mainly just have prod logs to go off now.
Thanks for this extra info @hntrmrrs, super helpful.
I haven't been able to reproduce this yet, but it sounds like maybe a race condition when Nomad servers close client RPC connections?
If it happens again, could you restart the Autoscaler using a SIGABRT
and send us the goroutine dump?
Hello guys,
I'm having the same issue with autoscaler version 0.3.6
.
I created a deployment with a scaling and everything was working. Then I deleted it and I saw these errors:
{"@level":"error","@message":"timeout: failed to read policy in
time","@module":"policy_manager.policy_handler","@timestamp":"2022-06-03T08:49:27.557602Z","policy_id":"e83ddfb5-981b-1060-12ef-ca874dc17791"}
The scaling policy id e83ddfb5-981b-1060-12ef-ca874dc17791
does not exist at all.
With the previous error I was also having this one:
{"@level":"error","@message":"encountered an error monitoring policy IDs","@module":"policy_manager","@timestamp":"2022-06-03T08:35:38.316593Z","error":"failed to call the Nomad list policies API: Unexpected response code: 504 (\u003chtml\u003e\r\n\u003chead\u003e\u003ctitle\u003e504 Gateway Time-out\u003c/title\u003e\u003c/head\u003e\r\n\u003cbody\u003e\r\n\u003ccenter\u003e\u003ch1\u003e504 Gateway Time-out\u003c/h1\u003e\u003c/center\u003e\r\n\u003chr\u003e\u003ccenter\u003enginx/1.18.0\u003c/center\u003e\r\n\u003c/body\u003e\r\n\u003c/html\u003e\r\n)"}
So instead of having errors like that, we should have a log level warning
or even info
saying that there is no scaling policies.
After I put back the deployment, all these errors disappeared.
Hi @Lord-Y 👋
Could you give a little bit more details on the steps you did to reproduce this? Did you run a job with a scaling
block and then stopped it?
And are you accessing the Nomad API via a proxy? From the second error message there seems to be an nginx in front of it
504 Gateway Time-out
Thanks!
@lgfa29 Yes I'm using nomad behind a nginx proxy. my config is
server {
listen 80 default_server;
server_name nomad;
location / {
proxy_pass https://127.0.0.1:4646;
proxy_redirect default;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Secured-Transport no;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Origin "${scheme}://${proxy_host}";
proxy_ssl_certificate /etc/nginx/.ssl/server.pem;
proxy_ssl_certificate_key /etc/nginx/.ssl/server-key.pem;
proxy_read_timeout 310s;
proxy_buffering off;
}
location /health {
proxy_pass https://127.0.0.1:4646/v1/status/peers;
proxy_ssl_certificate /etc/nginx/.ssl/server.pem;
proxy_ssl_certificate_key /etc/nginx/.ssl/server-key.pem;
}
}
My autoscaler config:
job "autoscaler" {
region = "[[.region]]"
datacenters = ["[[.datacenter]]"]
type = "service"
group "autoscaler" {
count = 1
network {
port "http" {}
}
task "autoscaler" {
vault {
policies = ["XXXX-[[.platform]]-nomad-autoscaler"]
}
driver = "docker"
config {
image = "hashicorp/nomad-autoscaler:0.3.7"
command = "nomad-autoscaler"
labels = {
elasticsearch_logging_index = "nomad-autoscaler"
}
args = [
"agent",
"-config",
"${NOMAD_TASK_DIR}/config.hcl",
"-http-bind-address",
"0.0.0.0",
"-http-bind-port",
"${NOMAD_PORT_http}",
"-policy-dir",
"${NOMAD_TASK_DIR}/policies/",
"-log-level",
"DEBUG",
"-enable-debug",
"-telemetry-prometheus-metrics",
"-log-json",
]
ports = ["http"]
}
template {
data = <<EOF
nomad {
address = "http://{{env "attr.unique.network.ip-address" }}"
{{ with secret "XXXX" }}
token = "{{ .Data.data.token }}"
{{ end }}
}
apm "prometheus" {
driver = "prometheus"
config = {
address = "https://XXXX-[[.platform]]-[[.datacenter]].XXXX"
}
}
target "gce-mig" {
driver = "gce-mig"
}
strategy "target-value" {
driver = "target-value"
}
EOF
destination = "${NOMAD_TASK_DIR}/config.hcl"
}
template {
data = <<EOF
scaling "cluster_policy" {
enabled = true
min = 3
max = 6
policy {
cooldown = "2m"
evaluation_interval = "1m"
check "cpu_allocated_percentage" {
source = "prometheus"
query = "sum(nomad_client_allocated_cpu{node_class=\"none\",region=\"[[.region]]\"}*100/(nomad_client_unallocated_cpu{node_class=\"none\",region=\"[[.region]]\"}+nomad_client_allocated_cpu{node_class=\"none\",region=\"[[.region]]\"}))/count(nomad_client_allocated_cpu{node_class=\"none\",region=\"[[.region]]\"})"
strategy "target-value" {
target = 80
}
}
check "mem_allocated_percentage" {
source = "prometheus"
query = "sum(nomad_client_allocated_memory{node_class=\"none\",region=\"[[.region]]\"}*100/(nomad_client_unallocated_memory{node_class=\"none\",region=\"[[.region]]\"}+nomad_client_allocated_memory{node_class=\"none\",region=\"[[.region]]\"}))/count(nomad_client_allocated_memory{node_class=\"none\",region=\"[[.region]]\"})"
strategy "target-value" {
target = 80
}
}
target "gce-mig" {
project = "XXXX-[[.platform]]"
region = "[[.region]]"
mig_name = "XXXX-nomad-client-[[.datacenter]]-XXXX-[[.platform]]"
node_class = "none"
node_drain_deadline = "5m"
node_purge = "true"
node_drain_ignore_system_jobs = "true"
}
}
}
EOF
destination = "${NOMAD_TASK_DIR}/policies/none.hcl"
}
resources {
cpu = 50
memory = 128
}
service {
name = "autoscaler"
port = "http"
check {
type = "http"
path = "/v1/health"
interval = "5s"
timeout = "2s"
}
}
}
}
}
My dummy app:
job "webapp" {
region = "[[.region]]"
datacenters = ["[[.datacenter]]"]
type = "service"
group "demo" {
count = 1
network {
port "webapp_http" {}
port "toxiproxy_webapp" {}
}
scaling {
enabled = true
min = 1
max = 20
policy {
cooldown = "1m"
evaluation_interval = "30s"
check "avg_sessions" {
source = "prometheus"
query = "sum(traefik_service_open_connections{service=~\".*webapp@consulcatalog\",region=\"[[.region]]\"})/scalar(nomad_nomad_job_summary_running{task_group=\"demo\",region=\"[[.region]]\"})"
strategy "target-value" {
target = 10
}
}
}
}
task "webapp" {
driver = "docker"
config {
image = "hashicorp/demo-webapp-lb-guide"
ports = ["webapp_http"]
}
env {
PORT = "${NOMAD_PORT_webapp_http}"
NODE_IP = "${NOMAD_IP_webapp_http}"
}
resources {
cpu = 500
memory = 256
}
service {
name = "webapp"
port = "webapp_http"
check {
type = "http"
path = "/"
interval = "3s"
timeout = "1s"
}
}
}
task "toxiproxy" {
driver = "docker"
lifecycle {
hook = "prestart"
sidecar = true
}
config {
image = "shopify/toxiproxy:2.1.4"
entrypoint = ["/entrypoint.sh"]
ports = ["toxiproxy_webapp"]
volumes = [
"local/entrypoint.sh:/entrypoint.sh",
]
}
template {
data = <<EOH
#!/bin/sh
set -ex
/go/bin/toxiproxy -host 0.0.0.0 &
while ! wget --spider -q http://localhost:8474/version; do
echo "toxiproxy not ready yet"
sleep 0.2
done
/go/bin/toxiproxy-cli create webapp -l 0.0.0.0:${NOMAD_PORT_toxiproxy_webapp} -u ${NOMAD_ADDR_webapp_http}
/go/bin/toxiproxy-cli toxic add -n latency -t latency -a latency=1000 -a jitter=500 webapp
tail -f /dev/null
EOH
destination = "local/entrypoint.sh"
perms = "755"
}
resources {
cpu = 100
memory = 32
}
service {
name = "toxiproxy-webapp"
port = "toxiproxy_webapp"
tags = [
"traefik.enable=true",
"traefik.http.routers.webapp.entrypoints=http",
"traefik.http.routers.webapp.rule=Host(`XXXX`)",
#"traefik.http.routers.webapp.rule=Path(`/`)",
]
}
}
}
}
My scaling policy:
nomad acl policy info scaling
Name = scaling
Description = scaling
Rules = namespace "*" {
policy = "write"
capabilities = ["list-scaling-policies","read-scaling-policy","read-job-scaling","read-job","list-jobs","scale-job"]
}
agent {
policy = "read"
}
node {
policy = "write"
}
I don't use this dummy app anymore which means I deleted it. So the scaling policy does not exist anymore:
nomad scaling policy list
No policies found
I just checked right now the logs of the autoscaler and here they are:
{"@level":"error","@message":"encountered an error monitoring policy IDs","@module":"policy_manager","@timestamp":"2022-07-19T07:57:31.052593Z","error":"failed to call the Nomad list policies API: Unexpected response code: 504 (\u003chtml\u003e\r\n\u003chead\u003e\u003ctitle\u003e504 Gateway Time-out\u003c/title\u003e\u003c/head\u003e\r\n\u003cbody\u003e\r\n\u003ccenter\u003e\u003ch1\u003e504 Gateway Time-out\u003c/h1\u003e\u003c/center\u003e\r\n\u003chr\u003e\u003ccenter\u003enginx/1.18.0\u003c/center\u003e\r\n\u003c/body\u003e\r\n\u003c/html\u003e\r\n)"}
So @lgfa29 i just gave you all the steps to reproduce the issue :-). Let's me know if you need more explanations.
We just found 6 different instances of nomad-autoscaler v0.3.6 in 6 different Nomad clusters all in this same "failed to read policy in time" state. Like @hntrmrrs mentioned, all 6 instances seem to start with receiving a 500 from the Nomad API:
2022-06-07T20:29:21.146Z [ERROR] policy_manager: encountered an error monitoring policy IDs: error="failed to call the Nomad list policies API: Unexpected response code: 500 (rpc error: EOF)"
2022-06-07T20:29:31.148Z [INFO] policy_manager: starting policy source: source=nomad
2022-06-07T20:29:31.148Z [INFO] policy_manager: starting policy source: source=file
2022-06-07T20:29:31.153Z [INFO] file_policy_source: starting file policy monitor: file=/local/policies/aws-asg.hcl name=... policy_id=d7c916c9-4a0b-34db-d6ac-943abc74e1e7
2022-06-07T20:29:31.154Z [INFO] file_policy_source: starting file policy monitor: file=/local/policies/aws-asg.hcl name=... policy_id=2830b7ec-4dfe-a6f5-1a8d-d012a7216233
2022-06-07T20:32:31.153Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=d7c916c9-4a0b-34db-d6ac-943abc74e1e7
2022-06-07T20:32:31.154Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=2830b7ec-4dfe-a6f5-1a8d-d012a7216233
I'm trying to confirm with some teammates, but I think this might have happened when we did an upgrade on the Nomad Servers.
If it happens again, could you restart the Autoscaler using a
SIGABRT
and send us the goroutine dump?
2022-07-19T18:44:52.664Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=46c49638-8dfd-b3e2-9340-9a637ed8d8da 2022-07-19T18:44:52.664Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ffbaf7d4-41c7-5703-fcee-a641f253a19f 2022-07-19T18:47:52.665Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=46c49638-8dfd-b3e2-9340-9a637ed8d8da 2022-07-19T18:47:52.665Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ffbaf7d4-41c7-5703-fcee-a641f253a19f 2022-07-19T18:50:52.665Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ffbaf7d4-41c7-5703-fcee-a641f253a19f 2022-07-19T18:50:52.665Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=46c49638-8dfd-b3e2-9340-9a637ed8d8da 2022-07-19T18:53:52.664Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=46c49638-8dfd-b3e2-9340-9a637ed8d8da 2022-07-19T18:53:52.664Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ffbaf7d4-41c7-5703-fcee-a641f253a19f 2022-07-19T18:56:52.664Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=46c49638-8dfd-b3e2-9340-9a637ed8d8da 2022-07-19T18:56:52.664Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ffbaf7d4-41c7-5703-fcee-a641f253a19f 2022-07-19T18:59:52.665Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ffbaf7d4-41c7-5703-fcee-a641f253a19f 2022-07-19T18:59:52.665Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=46c49638-8dfd-b3e2-9340-9a637ed8d8da 2022-07-19T19:02:52.666Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=ffbaf7d4-41c7-5703-fcee-a641f253a19f 2022-07-19T19:02:52.666Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=46c49638-8dfd-b3e2-9340-9a637ed8d8da SIGABRT: abort PC=0x46aac1 m=0 sigcode=0 goroutine 0 [idle]: runtime.futex() /usr/local/go/src/runtime/sys_linux_amd64.s:519 +0x21 runtime.futexsleep(0xffffffffffff8000, 0x2561be0, 0xc00004c800) /usr/local/go/src/runtime/os_linux.go:44 +0x36 runtime.notesleep(0x2561d30) /usr/local/go/src/runtime/lock_futex.go:160 +0x87 runtime.mPark() /usr/local/go/src/runtime/proc.go:1441 +0x2a runtime.stoplockedm() /usr/local/go/src/runtime/proc.go:2602 +0x65 runtime.schedule() /usr/local/go/src/runtime/proc.go:3299 +0x3d runtime.park_m(0xc000520680) /usr/local/go/src/runtime/proc.go:3516 +0x14d runtime.mcall() /usr/local/go/src/runtime/asm_amd64.s:307 +0x43 goroutine 1 [chan receive, 1262 minutes]: github.com/hashicorp/nomad-autoscaler/agent.(*Agent).handleSignals(0xc0003cdb20) /home/circleci/project/project/agent/agent.go:270 +0xe5 github.com/hashicorp/nomad-autoscaler/agent.(*Agent).Run(0xc0003cdb20) /home/circleci/project/project/agent/agent.go:100 +0x446 github.com/hashicorp/nomad-autoscaler/command.(*AgentCommand).Run(0xc000219a40, {0xc00003a0a0, 0x203000, 0x0}) /home/circleci/project/project/command/agent.go:333 +0x8e5 github.com/mitchellh/cli.(*CLI).Run(0xc0003a5400) /go/pkg/mod/github.com/mitchellh/cli@v1.1.2/cli.go:262 +0x5f8 main.main() /home/circleci/project/project/main.go:26 +0x24d goroutine 7 [select]: go.opencensus.io/stats/view.(*worker).start(0xc0000f4980) /go/pkg/mod/go.opencensus.io@v0.22.5/stats/view/worker.go:276 +0xb9 created by go.opencensus.io/stats/view.init.0 /go/pkg/mod/go.opencensus.io@v0.22.5/stats/view/worker.go:34 +0x92 goroutine 10 [IO wait]: internal/poll.runtime_pollWait(0x7f21944c2e58, 0x72) /usr/local/go/src/runtime/netpoll.go:234 +0x89 internal/poll.(*pollDesc).wait(0xc0000f5800, 0x416f46, 0x0) /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32 internal/poll.(*pollDesc).waitRead(...) /usr/local/go/src/internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Accept(0xc0000f5800) /usr/local/go/src/internal/poll/fd_unix.go:402 +0x22c net.(*netFD).accept(0xc0000f5800) /usr/local/go/src/net/fd_unix.go:173 +0x35 net.(*TCPListener).accept(0xc00039d170) /usr/local/go/src/net/tcpsock_posix.go:140 +0x28 net.(*TCPListener).Accept(0xc00039d170) /usr/local/go/src/net/tcpsock.go:262 +0x3d net/http.(*Server).Serve(0xc0001e1340, {0x1a5bd18, 0xc00039d170}) /usr/local/go/src/net/http/server.go:3002 +0x394 github.com/hashicorp/nomad-autoscaler/agent/http.(*Server).Start(0xc0002df360) /home/circleci/project/project/agent/http/server.go:124 +0xd2 created by github.com/hashicorp/nomad-autoscaler/command.(*AgentCommand).Run /home/circleci/project/project/command/agent.go:330 +0x896 goroutine 34 [chan receive]: github.com/hashicorp/nomad-autoscaler/plugins/builtin/target/nomad/plugin.(*TargetPlugin).garbageCollectionLoop(0xc000469740) /home/circleci/project/project/plugins/builtin/target/nomad/plugin/plugin.go:213 +0xc9 created by github.com/hashicorp/nomad-autoscaler/plugins/builtin/target/nomad/plugin.(*TargetPlugin).SetConfig /home/circleci/project/project/plugins/builtin/target/nomad/plugin/plugin.go:94 +0xef goroutine 20 [syscall, 1263 minutes]: os/signal.signal_recv() /usr/local/go/src/runtime/sigqueue.go:169 +0x98 os/signal.loop() /usr/local/go/src/os/signal/signal_unix.go:24 +0x19 created by os/signal.Notify.func1.1 /usr/local/go/src/os/signal/signal.go:151 +0x2c goroutine 21 [select, 100962 minutes]: github.com/armon/go-metrics.(*InmemSignal).run(0xc00053e240) /go/pkg/mod/github.com/armon/go-metrics@v0.3.7/inmem_signal.go:64 +0x6c created by github.com/armon/go-metrics.NewInmemSignal /go/pkg/mod/github.com/armon/go-metrics@v0.3.7/inmem_signal.go:38 +0x174 goroutine 23 [sleep]: time.Sleep(0x3b9aca00) /usr/local/go/src/runtime/time.go:193 +0x12e github.com/armon/go-metrics.(*Metrics).collectStats(0xc0004f2000) /go/pkg/mod/github.com/armon/go-metrics@v0.3.7/metrics.go:230 +0x25 created by github.com/armon/go-metrics.New /go/pkg/mod/github.com/armon/go-metrics@v0.3.7/start.go:84 +0x17b goroutine 25 [select, 100962 minutes]: github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa}) /home/circleci/project/project/policyeval/broker.go:246 +0x1e5 github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa}) /home/circleci/project/project/policyeval/broker.go:176 +0x19a github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a1e0, {0x1a6fb30, 0xc000400c00}) /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6 created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers /home/circleci/project/project/agent/agent.go:128 +0x25f goroutine 26 [select, 100962 minutes]: github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa}) /home/circleci/project/project/policyeval/broker.go:246 +0x1e5 github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa}) /home/circleci/project/project/policyeval/broker.go:176 +0x19a github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a230, {0x1a6fb30, 0xc000400c00}) /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6 created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers /home/circleci/project/project/agent/agent.go:128 +0x25f goroutine 27 [select, 100962 minutes]: github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa}) /home/circleci/project/project/policyeval/broker.go:246 +0x1e5 github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa}) /home/circleci/project/project/policyeval/broker.go:176 +0x19a github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a280, {0x1a6fb30, 0xc000400c00}) /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6 created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers /home/circleci/project/project/agent/agent.go:128 +0x25f goroutine 28 [select, 100962 minutes]: github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa}) /home/circleci/project/project/policyeval/broker.go:246 +0x1e5 github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa}) /home/circleci/project/project/policyeval/broker.go:176 +0x19a github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a2d0, {0x1a6fb30, 0xc000400c00}) /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6 created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers /home/circleci/project/project/agent/agent.go:128 +0x25f goroutine 29 [select, 100962 minutes]: github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa}) /home/circleci/project/project/policyeval/broker.go:246 +0x1e5 github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa}) /home/circleci/project/project/policyeval/broker.go:176 +0x19a github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a320, {0x1a6fb30, 0xc000400c00}) /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6 created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers /home/circleci/project/project/agent/agent.go:128 +0x25f goroutine 30 [select, 100962 minutes]: github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa}) /home/circleci/project/project/policyeval/broker.go:246 +0x1e5 github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa}) /home/circleci/project/project/policyeval/broker.go:176 +0x19a github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a370, {0x1a6fb30, 0xc000400c00}) /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6 created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers /home/circleci/project/project/agent/agent.go:128 +0x25f goroutine 31 [select, 100962 minutes]: github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa}) /home/circleci/project/project/policyeval/broker.go:246 +0x1e5 github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa}) /home/circleci/project/project/policyeval/broker.go:176 +0x19a github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a3c0, {0x1a6fb30, 0xc000400c00}) /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6 created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers /home/circleci/project/project/agent/agent.go:128 +0x25f goroutine 32 [select, 100962 minutes]: github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa}) /home/circleci/project/project/policyeval/broker.go:246 +0x1e5 github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa}) /home/circleci/project/project/policyeval/broker.go:176 +0x19a github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a410, {0x1a6fb30, 0xc000400c00}) /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6 created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers /home/circleci/project/project/agent/agent.go:128 +0x25f goroutine 49 [select, 100962 minutes]: github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa}) /home/circleci/project/project/policyeval/broker.go:246 +0x1e5 github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa}) /home/circleci/project/project/policyeval/broker.go:176 +0x19a github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a460, {0x1a6fb30, 0xc000400c00}) /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6 created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers /home/circleci/project/project/agent/agent.go:128 +0x25f goroutine 50 [select, 100962 minutes]: github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa}) /home/circleci/project/project/policyeval/broker.go:246 +0x1e5 github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x1843df3, 0xa}) /home/circleci/project/project/policyeval/broker.go:176 +0x19a github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a4b0, {0x1a6fb30, 0xc000400c00}) /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6 created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers /home/circleci/project/project/agent/agent.go:128 +0x25f goroutine 51 [select, 59053 minutes]: github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7}) /home/circleci/project/project/policyeval/broker.go:246 +0x1e5 github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7}) /home/circleci/project/project/policyeval/broker.go:176 +0x19a github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a500, {0x1a6fb30, 0xc000400c00}) /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6 created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers /home/circleci/project/project/agent/agent.go:134 +0x35b goroutine 52 [select, 59052 minutes]: github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7}) /home/circleci/project/project/policyeval/broker.go:246 +0x1e5 github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7}) /home/circleci/project/project/policyeval/broker.go:176 +0x19a github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a550, {0x1a6fb30, 0xc000400c00}) /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6 created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers /home/circleci/project/project/agent/agent.go:134 +0x35b goroutine 53 [select, 59052 minutes]: github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7}) /home/circleci/project/project/policyeval/broker.go:246 +0x1e5 github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7}) /home/circleci/project/project/policyeval/broker.go:176 +0x19a github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a5a0, {0x1a6fb30, 0xc000400c00}) /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6 created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers /home/circleci/project/project/agent/agent.go:134 +0x35b goroutine 54 [select, 59051 minutes]: github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7}) /home/circleci/project/project/policyeval/broker.go:246 +0x1e5 github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7}) /home/circleci/project/project/policyeval/broker.go:176 +0x19a github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a5f0, {0x1a6fb30, 0xc000400c00}) /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6 created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers /home/circleci/project/project/agent/agent.go:134 +0x35b goroutine 55 [select, 59051 minutes]: github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7}) /home/circleci/project/project/policyeval/broker.go:246 +0x1e5 github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7}) /home/circleci/project/project/policyeval/broker.go:176 +0x19a github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a640, {0x1a6fb30, 0xc000400c00}) /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6 created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers /home/circleci/project/project/agent/agent.go:134 +0x35b goroutine 56 [select, 59055 minutes]: github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7}) /home/circleci/project/project/policyeval/broker.go:246 +0x1e5 github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7}) /home/circleci/project/project/policyeval/broker.go:176 +0x19a github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a690, {0x1a6fb30, 0xc000400c00}) /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6 created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers /home/circleci/project/project/agent/agent.go:134 +0x35b goroutine 57 [select, 59055 minutes]: github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7}) /home/circleci/project/project/policyeval/broker.go:246 +0x1e5 github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7}) /home/circleci/project/project/policyeval/broker.go:176 +0x19a github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a6e0, {0x1a6fb30, 0xc000400c00}) /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6 created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers /home/circleci/project/project/agent/agent.go:134 +0x35b goroutine 58 [select, 59054 minutes]: github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7}) /home/circleci/project/project/policyeval/broker.go:246 +0x1e5 github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7}) /home/circleci/project/project/policyeval/broker.go:176 +0x19a github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a730, {0x1a6fb30, 0xc000400c00}) /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6 created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers /home/circleci/project/project/agent/agent.go:134 +0x35b goroutine 59 [select, 59055 minutes]: github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7}) /home/circleci/project/project/policyeval/broker.go:246 +0x1e5 github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7}) /home/circleci/project/project/policyeval/broker.go:176 +0x19a github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a780, {0x1a6fb30, 0xc000400c00}) /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6 created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers /home/circleci/project/project/agent/agent.go:134 +0x35b goroutine 60 [select, 59053 minutes]: github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).waitForWork(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7}) /home/circleci/project/project/policyeval/broker.go:246 +0x1e5 github.com/hashicorp/nomad-autoscaler/policyeval.(*Broker).Dequeue(0xc000366d20, {0x1a6fb30, 0xc000400c00}, {0x183fb5a, 0x7}) /home/circleci/project/project/policyeval/broker.go:176 +0x19a github.com/hashicorp/nomad-autoscaler/policyeval.(*BaseWorker).Run(0xc00036a7d0, {0x1a6fb30, 0xc000400c00}) /home/circleci/project/project/policyeval/base_worker.go:60 +0xb6 created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).initWorkers /home/circleci/project/project/agent/agent.go:134 +0x35b goroutine 61 [select, 59051 minutes]: github.com/hashicorp/nomad-autoscaler/agent.(*Agent).runEvalHandler(0xc0003cdb20, {0x1a6fb30, 0xc000400c00}, 0xc000366cc0) /home/circleci/project/project/agent/agent.go:106 +0xb2 created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).Run /home/circleci/project/project/agent/agent.go:97 +0x439 goroutine 5811358 [select, 5 minutes]: net/http.(*persistConn).roundTrip(0xc00053a000, 0xc0006f22c0) /usr/local/go/src/net/http/transport.go:2614 +0x97d net/http.(*Transport).roundTrip(0xc0001e48c0, 0xc000191100) /usr/local/go/src/net/http/transport.go:594 +0x7d1 net/http.(*Transport).RoundTrip(0x30, 0x1a3c1e0) /usr/local/go/src/net/http/roundtrip.go:18 +0x19 net/http.send(0xc000191100, {0x1a3c1e0, 0xc0001e48c0}, {0x17ff560, 0x40ce01, 0x0}) /usr/local/go/src/net/http/client.go:252 +0x5d8 net/http.(*Client).send(0xc000465e00, 0xc000191100, {0x203000, 0x203000, 0x0}) /usr/local/go/src/net/http/client.go:176 +0x9b net/http.(*Client).do(0xc000465e00, 0xc000191100) /usr/local/go/src/net/http/client.go:725 +0x908 net/http.(*Client).Do(...) /usr/local/go/src/net/http/client.go:593 github.com/hashicorp/nomad/api.(*Client).doRequest(0xc0003b21c0, 0xc0000e06c0) /go/pkg/mod/github.com/hashicorp/nomad/api@v0.0.0-20210503143957-4ccada7924cf/api.go:722 +0x65 github.com/hashicorp/nomad/api.(*Client).query(0xc0004c37b0, {0x184fced, 0xc0006920c0}, {0x1478d40, 0xc00043adc8}, 0x464a00) /go/pkg/mod/github.com/hashicorp/nomad/api@v0.0.0-20210503143957-4ccada7924cf/api.go:841 +0xb4 github.com/hashicorp/nomad/api.(*Scaling).ListPolicies(0xc0004c37b0, 0xc00043e760) /go/pkg/mod/github.com/hashicorp/nomad/api@v0.0.0-20210503143957-4ccada7924cf/scaling.go:20 +0x65 github.com/hashicorp/nomad-autoscaler/policy/nomad.(*Source).MonitorIDs.func1() /home/circleci/project/project/policy/nomad/source.go:109 +0xca created by github.com/hashicorp/nomad-autoscaler/policy/nomad.(*Source).MonitorIDs /home/circleci/project/project/policy/nomad/source.go:102 +0x1e6 goroutine 5206364 [select]: github.com/hashicorp/nomad-autoscaler/policy.(*Handler).Run(0xc00021de00, {0x1a6fb30, 0xc000400c00}, 0xc0002a8fd0) /home/circleci/project/project/policy/handler.go:118 +0x411 github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run.func1({0xc0000406f0, 0x24}) /home/circleci/project/project/policy/manager.go:116 +0x46 created by github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run /home/circleci/project/project/policy/manager.go:115 +0xed3 goroutine 5206401 [select, 1263 minutes]: github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run(0xc00036a140, {0x1a6fb30, 0xc000400c00}, 0xc000366cc0) /home/circleci/project/project/policy/manager.go:72 +0x475 created by github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run /home/circleci/project/project/policy/manager.go:154 +0x847 goroutine 5206403 [select, 1263 minutes]: github.com/hashicorp/nomad-autoscaler/policy/file.(*Source).MonitorIDs(0xc0004b6100, {0x1a6fb30, 0xc000400340}, {0xc00035e360, 0xc00035e1e0}) /home/circleci/project/project/policy/file/source.go:84 +0x131 created by github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run /home/circleci/project/project/policy/manager.go:67 +0x1f3 goroutine 5206404 [select, 5 minutes]: github.com/hashicorp/nomad-autoscaler/policy/nomad.(*Source).MonitorIDs(0xc00037a240, {0x1a6fb30, 0xc000400340}, {0xc00035e360, 0xc00035e1e0}) /home/circleci/project/project/policy/nomad/source.go:113 +0x26d created by github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run /home/circleci/project/project/policy/manager.go:67 +0x1f3 goroutine 5206410 [select, 1263 minutes]: github.com/hashicorp/nomad-autoscaler/policy/file.(*Source).MonitorPolicy(0xc0004b6100, {0x1a6fb30, 0xc0004005c0}, {{0xc0000406f0, 0x24}, 0xc00007e720, 0xc00007e840, 0xc00007e6c0}) /home/circleci/project/project/policy/file/source.go:149 +0x7bc created by github.com/hashicorp/nomad-autoscaler/policy.(*Handler).Run /home/circleci/project/project/policy/handler.go:115 +0x2e5 goroutine 5206402 [select]: github.com/hashicorp/nomad-autoscaler/policy.(*Manager).periodicMetricsReporter(0xc00036a140, {0x1a6fb30, 0xc000400340}, 0xc000366cc0) /home/circleci/project/project/policy/manager.go:224 +0xdf created by github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run /home/circleci/project/project/policy/manager.go:61 +0x196 goroutine 5811386 [IO wait, 5 minutes]: internal/poll.runtime_pollWait(0x7f21944c2c88, 0x72) /usr/local/go/src/runtime/netpoll.go:234 +0x89 internal/poll.(*pollDesc).wait(0xc000466f00, 0xc000131800, 0x0) /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32 internal/poll.(*pollDesc).waitRead(...) /usr/local/go/src/internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Read(0xc000466f00, {0xc000131800, 0xb09, 0xb09}) /usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a net.(*netFD).Read(0xc000466f00, {0xc000131800, 0xc00008d758, 0xc00008d758}) /usr/local/go/src/net/fd_posix.go:56 +0x29 net.(*conn).Read(0xc0003720e0, {0xc000131800, 0xc000315800, 0xc00008d7f8}) /usr/local/go/src/net/net.go:183 +0x45 crypto/tls.(*atLeastReader).Read(0xc00043ae70, {0xc000131800, 0x0, 0x409b8d}) /usr/local/go/src/crypto/tls/conn.go:777 +0x3d bytes.(*Buffer).ReadFrom(0xc0001c4978, {0x1a39800, 0xc00043ae70}) /usr/local/go/src/bytes/buffer.go:204 +0x98 crypto/tls.(*Conn).readFromUntil(0xc0001c4700, {0x1a3c0c0, 0xc0003720e0}, 0x183be26) /usr/local/go/src/crypto/tls/conn.go:799 +0xe5 crypto/tls.(*Conn).readRecordOrCCS(0xc0001c4700, 0x0) /usr/local/go/src/crypto/tls/conn.go:606 +0x112 crypto/tls.(*Conn).readRecord(...) /usr/local/go/src/crypto/tls/conn.go:574 crypto/tls.(*Conn).Read(0xc0001c4700, {0xc000356000, 0x1000, 0x10100c0004ad800}) /usr/local/go/src/crypto/tls/conn.go:1277 +0x16f net/http.(*persistConn).Read(0xc00053a000, {0xc000356000, 0x40505d, 0x60}) /usr/local/go/src/net/http/transport.go:1926 +0x4e bufio.(*Reader).fill(0xc0002b0a80) /usr/local/go/src/bufio/bufio.go:101 +0x103 bufio.(*Reader).Peek(0xc0002b0a80, 0x1) /usr/local/go/src/bufio/bufio.go:139 +0x5d net/http.(*persistConn).readLoop(0xc00053a000) /usr/local/go/src/net/http/transport.go:2087 +0x1ac created by net/http.(*Transport).dialConn /usr/local/go/src/net/http/transport.go:1747 +0x1e05 goroutine 5811243 [IO wait]: internal/poll.runtime_pollWait(0x7f21944c28e8, 0x72) /usr/local/go/src/runtime/netpoll.go:234 +0x89 internal/poll.(*pollDesc).wait(0xc000466e80, 0xc000122000, 0x0) /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32 internal/poll.(*pollDesc).waitRead(...) /usr/local/go/src/internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Read(0xc000466e80, {0xc000122000, 0x1000, 0x1000}) /usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a net.(*netFD).Read(0xc000466e80, {0xc000122000, 0x4cee66, 0x2560c80}) /usr/local/go/src/net/fd_posix.go:56 +0x29 net.(*conn).Read(0xc00000ed40, {0xc000122000, 0x0, 0xc000732758}) /usr/local/go/src/net/net.go:183 +0x45 net/http.(*connReader).Read(0xc000732750, {0xc000122000, 0x1000, 0x1000}) /usr/local/go/src/net/http/server.go:780 +0x16d bufio.(*Reader).fill(0xc000692060) /usr/local/go/src/bufio/bufio.go:101 +0x103 bufio.(*Reader).Peek(0xc000692060, 0x4) /usr/local/go/src/bufio/bufio.go:139 +0x5d net/http.(*conn).serve(0xc000422000, {0x1a6fbd8, 0xc00042ede0}) /usr/local/go/src/net/http/server.go:1955 +0xc36 created by net/http.(*Server).Serve /usr/local/go/src/net/http/server.go:3034 +0x4e8 goroutine 5811387 [select, 5 minutes]: net/http.(*persistConn).writeLoop(0xc00053a000) /usr/local/go/src/net/http/transport.go:2386 +0xfb created by net/http.(*Transport).dialConn /usr/local/go/src/net/http/transport.go:1748 +0x1e65 goroutine 5206365 [select]: github.com/hashicorp/nomad-autoscaler/policy.(*Handler).Run(0xc00021df40, {0x1a6fb30, 0xc000400c00}, 0x0) /home/circleci/project/project/policy/handler.go:118 +0x411 github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run.func1({0xc000040720, 0x24}) /home/circleci/project/project/policy/manager.go:116 +0x46 created by github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run /home/circleci/project/project/policy/manager.go:115 +0xed3 goroutine 5206366 [select, 1262 minutes]: github.com/hashicorp/nomad-autoscaler/policy/file.(*Source).MonitorPolicy(0xc0004b6100, {0x1a6fb30, 0xc0006f35c0}, {{0xc000040720, 0x24}, 0xc00007e900, 0xc00007ea80, 0xc00007e8a0}) /home/circleci/project/project/policy/file/source.go:149 +0x7bc created by github.com/hashicorp/nomad-autoscaler/policy.(*Handler).Run /home/circleci/project/project/policy/handler.go:115 +0x2e5 rax 0xca rbx 0x0 rcx 0xffffffffffffffff rdx 0x0 rdi 0x2561d30 rsi 0x80 rbp 0x7fffd8ad9100 rsp 0x7fffd8ad90b8 r8 0x0 r9 0x0 r10 0x0 r11 0x286 r12 0x43e700 r13 0x0 r14 0x2561400 r15 0x0 rip 0x46aac1 rflags 0x286 cs 0x33 fs 0x0 gs 0x0
Maybe it would make sense to have the health check start to fail when there's any kind failure to read policies?
We are also affected by this, quite frequently, at least once a week.
The one that just occurred also seems to have started after a 500. Something goes wrong when trying to list policies and then it never comes right again.
2022-09-17T12:22:46.020Z [ERROR] policy_manager: encountered an error monitoring policy IDs: error="failed to call the Nomad list policies API: Get "https://nomad.cluster.net/v1/scaling/policies?index=1&namespace=default®ion=global&wait=300000ms": EOF"
Somehow the policies both become nil
and it can't seem to fetch them again?
As you can see from the log dump below, we refresh the config quite frequently because of expiring tokens, and that doesn't fix it either, only a restart does.
I've only looked for a few minutes at the code, so won't be of any help, but I'd really like to see a fix or mitigation for this soon! It's a bit of a showstopper!
Thanks for the updates everyone! This has been a very tricky bug the pinpoint because I suspect it may be related to problems both in the Autoscaler and in the Nomad API.
I know it's frustrating to not have a fix for this common occurrence, and we appreciate the patience. Keep the reports and additional information coming, they are all very helpful!
Is there a way to dump these logs in a file within the container and have some kind of check on that file for the Error line? This solution is just a workaround to restart the POD if the log contains the error.
2022-06-07T20:32:31.153Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=d7c916c9-4a0b-34db-d6ac-943abc74e1e7
I've tried below statement in Dockerfile but file doesn't capturing the logs -
RUN touch /var/log/nomad-autoscaler.log \
&& ln -sf /proc/$$/fd/1 /var/log/nomad-autoscaler.log
Any suggestions?
Hi, just tagging along that we are also seeing this. The repo for us was our Nomad server got redeployed which caused the autoscaler to log failed to read policy in time
for each policy evaluation until the deployment was restarted.
Is there a way we can detect these policy failures so we can proactively react to them before our system hits capacity issues due to lack of scaling? We are scraping prometheus metrics and I can see metrics for e.g. how long policies take to run etc, bit I don't see anything that shows the resolution for a policy evaluation. Are there any metrics we could use to detect if a policy fails to run to completion successfully, so we could build alerting rules?
hi @peter-lockhart-pub and others, this is how I have put a hack to resolve the problem which actually kill the pod if error pops up in the log -
Added these 2 scripts in a config map
Next, mount this config map on /scripts
volume and created an emptyDir volume /logs
.
My nomad-autoscaler
deployment YAML somewhat looks like the below -
How it works:
entrypoint.sh
script is sending the standard output and error logs into a file named /logs/nomad-autoscaler.log
which is being created in ephemeral volume (emptyDir) logs
liveness-on-log.sh
script is configured as livenessProbe
which is keep running in few seconds interval and grepping the log file created via entrypoint.sh
and if the line 'policy_manager.policy_handler: timeout: failed to read policy in time
exists, it will kill the nomad-autoscaler process which kills the pod itself. Hi @atulsingh0 👋
Your case may be a different error, do you see the EOF
message at the end of the error log?
@peter-lockhart-pub that's a good question. Since I haven't been able to reproduce this yet I can't tell for sure, but maybe nomad-autoscaler.policy.source.error_count
would increase when the EOF
error happens?
For us, this problem appears when the nomad server node restarts (for instance, manually restarting it with systemctl restart nomad
).
I was able to reproduce it multiple times and the outcome was always the following:
2023-07-10T13:36:54.319Z [ERROR] policy_manager: encountered an error monitoring policy IDs: error="failed to call the Nomad list policies API: Unexpected response code: 500 (rpc error: EOF)" 2023-07-10T13:37:04.320Z [INFO] policy_manager: starting policy source: source=file 2023-07-10T13:37:04.320Z [INFO] policy_manager: starting policy source: source=nomad 2023-07-10T13:37:04.321Z [INFO] file_policy_source: starting file policy monitor: file=local/policies/batch.hcl name=batch policy_id=xxx 2023-07-10T13:40:04.321Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=xxx 2023-07-10T13:43:04.321Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=xxx 2023-07-10T13:46:04.321Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=xxx 2023-07-10T13:49:04.321Z [ERROR] policy_manager.policy_handler: timeout: failed to read policy in time: policy_id=xxx
A possible quick workaround was inspired by atulsingh0:
Disclaimer: I came out with this fast-and-furious solution to restart the autoscaler job but it's far from perfect and it works only if the autoscaler logs are in the file autoscaler.stderr.0
. Use it at your own risk
service {
check {
type = "script"
command = "/bin/sh"
args = ["-c", "! tail -n 1 alloc/logs/autoscaler.stderr.0 | grep -q 'policy_manager.policy_handler: timeout: failed to read policy in time'"]
interval = "60s"
timeout = "10s"
check_restart {
limit = 3
grace = "90s"
ignore_warnings = false
}
}
}
Heya folks, apologies for the long delay here, thanks for all your input!
PR #733 is up for "failed to read policy in time" errors (with a nod to @Seros), specifically those occurring in file policies, which I think is the bulk of the concern in this issue. File policies getting "stuck" like this can be caused by various different expressions of Nomad API errors -- at the moment, any error when listing (Nomad) policies that includes "EOF" or "connection refused" -- that triggers a soft re-start of the internal policy manager, which then fails to re-load file policies properly.
I think some other issues were brought up in this issue too, so if this does not address your situation, please let us know! We may suggest opening a new issue to keep things tidy, but we don't want to lose track of important errors.
Hi folks,
Seeing some more errors and Nomad jobs that should definitely be triggering scaling are not happening.
https://gist.github.com/evandam/4956199b01cd4303873b69cf7d3779c1
My first thought is that the Nomad agent is having issues, but it seems okay and I'm able to hit the same endpoints in the logs directly on the host 🤷♂️
As a side note, is there a decent way to debug things? There's
policy_id
in logs but I'm not sure how that maps back to the actual policy or Nomad job in question. It seems like it would be very helpful to be able to trace these back.We're running Nomad 1.1.2 for what it's worth.
Any thoughts?