nomad.TestServer should shut down synchronously

While investigating a test failure in https://github.com/hashicorp/nomad/pull/10590 I encountered what might be a goroutine leak when we step down from leadership in (*BlockedEvals).prune and (*BlockedEvals).watchCapacity.

If this pans out, this is an older area of the code base and I suspect it will impact all supported versions of Nomad.

At the time of the test panic, the test was running the TestRPC_Limits_OK test, which itself spins up parallel subtests with one server each, and TestJobEndpoint_Register_Connect_ValidatesWithoutSidecarTask, which spins up one server. So we should have at most 2 servers running and 2 enabled BlockedEvals runners.

But we have 22!

$ cat stacks.log | gostack2json | \
  jq '[.[] | select(.Stack[0].Func == "github.com/hashicorp/nomad/nomad.(*BlockedEvals).prune")] | length'
22
$ cat stacks.log| gostack2json | \
  jq '[.[] | select(.Stack[0].Func == "github.com/hashicorp/nomad/nomad.(*BlockedEvals).watchCapacity")] | length'
22

The (*BlockedEvals).SetEnabled method that enables/disables those two goroutines for the leader looks correct to me at first glance, but it's designed a bit differently from how we did (*PeriodicDispatch).SetEnabled or (*EvalBroker).SetEnabled. It's at least worth looking into why it was done differently.

The trouble with this is that we don't know if this is being leaked from previous tests that have passed and not getting cleaned up correctly, or whether this is located within a single server.

hashicorp / nomad

nomad.TestServer should shut down synchronously #10597