hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.77k stars 1.94k forks source link

nomad.TestServer should shut down synchronously #10597

Open tgross opened 3 years ago

tgross commented 3 years ago

While investigating a test failure in https://github.com/hashicorp/nomad/pull/10590 I encountered what might be a goroutine leak when we step down from leadership in (*BlockedEvals).prune and (*BlockedEvals).watchCapacity.

If this pans out, this is an older area of the code base and I suspect it will impact all supported versions of Nomad.

At the time of the test panic, the test was running the TestRPC_Limits_OK test, which itself spins up parallel subtests with one server each, and TestJobEndpoint_Register_Connect_ValidatesWithoutSidecarTask, which spins up one server. So we should have at most 2 servers running and 2 enabled BlockedEvals runners.

But we have 22!

$ cat stacks.log | gostack2json | \
  jq '[.[] | select(.Stack[0].Func == "github.com/hashicorp/nomad/nomad.(*BlockedEvals).prune")] | length'
22
$ cat stacks.log| gostack2json | \
  jq '[.[] | select(.Stack[0].Func == "github.com/hashicorp/nomad/nomad.(*BlockedEvals).watchCapacity")] | length'
22

The (*BlockedEvals).SetEnabled method that enables/disables those two goroutines for the leader looks correct to me at first glance, but it's designed a bit differently from how we did (*PeriodicDispatch).SetEnabled or (*EvalBroker).SetEnabled. It's at least worth looking into why it was done differently.

The trouble with this is that we don't know if this is being leaked from previous tests that have passed and not getting cleaned up correctly, or whether this is located within a single server.

tgross commented 3 years ago

Because our test server doesn't stop synchronously, it looks like these goroutines are being leaked from previous tests. Going to rename this issue and mark it for testing improvements.