openHPI / poseidon

Scalable task execution orchestrator for CodeOcean
MIT License
8 stars 1 forks source link

New connections to draining Runners #659

Open mpass99 opened 4 weeks ago

mpass99 commented 4 weeks ago

Our current drain_on_shutdown strategy for stopping Nomad agents is:

The executions that don't have enough time to finish result in a user-visible error.

We might need to "exclude" some runners for new executions as soon as the respective Nomad agent is about to shut down.

See #651


Unfortunately, we currently don't have any metric to count how often this issue occurs.

MrSerth commented 3 weeks ago

ToDo: Let's identify which error / log information / ... we get when above issue occurs.

mpass99 commented 3 days ago

We've conducted a local reproduction of this scenario: nomadEventLog-ExecuteDraining.txt. It shows that POSEIDON-3W (#590) with the sub-error the allocation was rescheduled indicates this error. This has not happened for at least 90 days.

If we consider a fix for this necessary in the future, we might consider listening to Nomad's Node events to receive drain updates, fetch all allocations of this node, and block new executions for these allocations/runners. Further, we should ensure that the drain deadline matches the maximum of all allowed execution timeouts (of CodeOcean).