openHPI / poseidon

Scalable task execution orchestrator for CodeOcean
MIT License
7 stars 2 forks source link

Timing Issue at Poseidon Restart leads to ignored Runners #598

Closed mpass99 closed 1 month ago

mpass99 commented 2 months ago

Today we became aware of another event of idle runner count not matching the prewarming pool size. Via our Poseidon Dashboard, we can trace back this deviation to the deployment of the 19th #465.

Evaluation In Poseidon's logs we can follow the events:

Discussion We have 2 seconds between the runners being requested and Poseidon being able to acknowledge new runners via the Event Stream. We see that the runners usually start in less than one second. Therefore, we assume that the 7 runners were started before Poseidon was ready to notice it.
Validation: In the Nomad UI, we can see 7 runners created on the 19th. All others were created today. Preliminary Fix Suggestion: Recover the runners after starting to listen to the event stream.

Extra Question: Why did the Prewarming Pool Alert not catch this issue?
We have configured the Alert Threshold to 50% and we had most of the time 50% or more of the Prewarming Pool (8/15).