Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
If this pans out, this is an older area of the code base and I suspect it will impact all supported versions of Nomad.
At the time of the test panic, the test was running the TestRPC_Limits_OK test, which itself spins up parallel subtests with one server each, and TestJobEndpoint_Register_Connect_ValidatesWithoutSidecarTask, which spins up one server. So we should have at most 2 servers running and 2 enabled BlockedEvals runners.
The trouble with this is that we don't know if this is being leaked from previous tests that have passed and not getting cleaned up correctly, or whether this is located within a single server.
Because our test server doesn't stop synchronously, it looks like these goroutines are being leaked from previous tests. Going to rename this issue and mark it for testing improvements.
While investigating a test failure in https://github.com/hashicorp/nomad/pull/10590 I encountered what might be a goroutine leak when we step down from leadership in
(*BlockedEvals).prune
and(*BlockedEvals).watchCapacity
.If this pans out, this is an older area of the code base and I suspect it will impact all supported versions of Nomad.
At the time of the test panic, the test was running the
TestRPC_Limits_OK
test, which itself spins up parallel subtests with one server each, andTestJobEndpoint_Register_Connect_ValidatesWithoutSidecarTask
, which spins up one server. So we should have at most 2 servers running and 2 enabledBlockedEvals
runners.But we have 22!
The
(*BlockedEvals).SetEnabled
method that enables/disables those two goroutines for the leader looks correct to me at first glance, but it's designed a bit differently from how we did(*PeriodicDispatch).SetEnabled
or(*EvalBroker).SetEnabled
. It's at least worth looking into why it was done differently.The trouble with this is that we don't know if this is being leaked from previous tests that have passed and not getting cleaned up correctly, or whether this is located within a single server.