Open bernardoVale opened 10 months ago
Hi @bernardoVale 👋
What kind of policies are these? Are they application scaling defined in jobs?
If that's the case, it seems to be coming from Nomad itself, more specifically from here: https://github.com/hashicorp/nomad/blob/dd15bdff9c35be3fe2d0c470feea016967acea9d/helper/pool/pool.go#L512
Have you noticed any errors in the Nomad log when this happens?
What kind of policies are these? Are they application scaling defined in jobs?
Correct, using nomad target plugin
Have you noticed any errors in the Nomad log when this happens?
It happened again today, and not, there was no error log in nomad server/clients
We had this recur today, and I thought to dig into the nomad-autoscaler logs to see what was happening at the time the errors started. It looked like:
[...]
2024-07-01T03:09:20.575Z [INFO] external_plugin.nomad-meta-apm-v2: query response: query="Meta contains \"run_foo-agent\" and SchedulingEligibility == \"eligible\" and Attributes contains \"driver.docker\"" total=10 ineligible=0 job=foo pool=all timestamp=2024-07-01T03:09:20.575Z
2024-07-01T03:09:20.581Z [INFO] external_plugin.nomad-meta-apm-v2: query response: query="Meta contains \"run_bar-gateway\" and SchedulingEligibility == \"eligible\" and Attributes contains \"driver.docker\"" total=1 ineligible=0 job=bar-gateway pool=default timestamp=2024-07-01T03:09:20.581Z
2024-07-01T03:09:20.590Z [INFO] external_plugin.nomad-meta-apm-v2: query response: ineligible=0 job=baz pool=default query="Meta contains \"node_type\" and Meta[\"node_type\"] == \"storage\" and SchedulingEligibility == \"eligible\" and Attributes contains \"driver.docker\"" total=3 timestamp=2024-07-01T03:09:20.590Z
2024-07-01T03:10:16.280Z [ERROR] policy_manager.policy_handler: rpc error: code = Unavailable desc = error reading from server: EOF: policy_id=a53e053c-0c60-1296-3e35-8e6ca0aa1a6f
2024-07-01T03:10:16.280Z [ERROR] external_plugin: plugin process exited: plugin=/opt/autoscaler-plugins/kentik-nomad-target id=17 error="exit status 2"
2024-07-01T03:11:17.874Z [ERROR] policy_manager.policy_handler: rpc error: code = Canceled desc = context canceled: policy_id=854225b7-34cd-e54c-7f0f-c06fd8be8903
2024-07-01T03:11:17.874Z [WARN] policy_manager.policy_handler: failed to get target status: policy_id=854225b7-34cd-e54c-7f0f-c06fd8be8903 error="rpc error: code = Canceled desc = context canceled"
[...]
So it begins to happen after the external plugin (nomad target) exits with an error. I went looking for the code that restarts the plugin on error, and couldn't find it. I did find this: https://github.com/hashicorp/nomad-autoscaler/blob/main/plugins/manager/manager.go#L202
// TODO(jrasell) if we do not find the instance, we should probably try and
// dispense the plugin. We should also check the plugin instance has not
// exited.
So is it the case that nomad-autoscaler simply can't recover if an external plugin dies? The 'rpc error:' text is generic enough that I wonder if perhaps the error message is coming from the gRPC communication between plugin client and server, rather than between plugin server and nomad API?
OK, I just tried this out by manually running kill <pid of kentik-nomad-target process>
The logs were a fine copy of the above:
[...]
2024-07-02T12:43:13.258Z [INFO] external_plugin.nomad-meta-apm-v2: query response: job=foo pool=default query="Meta contains \"run_foo\" and SchedulingEligibility == \"eligible\" and Attributes contains \"driver.docker\"" total=2 ineligible=0 timestamp=2024-07-02T12:43:13.258Z
2024-07-02T12:44:22.364Z [ERROR] external_plugin: plugin process exited: plugin=/opt/autoscaler-plugins/kentik-nomad-target id=40 error="signal: terminated"
2024-07-02T12:45:10.344Z [WARN] policy_manager.policy_handler: failed to get target status: policy_id=ff2ef4dc-b2b7-297e-ed89-e91cfd65cace error="rpc error: code = Canceled desc = context canceled"
[...]
So in terms of possible solutions, I guess one of:
?
I've got a test case and first pass at a fix in https://github.com/hashicorp/nomad-autoscaler/pull/931
I couldn't find a way to reproduce this error, but sometimes, autoscaler gets stuck and returns
context canceled
to all policy evaluations:Each eval loop, it prints one
context canceled
error per policy. Once it gets into this state, only a restart fixes it.Here's the SIGABRT dump: