openHPI / poseidon

Scalable task execution orchestrator for CodeOcean
MIT License
7 stars 2 forks source link

Handle permanently dead Nomad jobs #612

Open mpass99 opened 1 month ago

mpass99 commented 1 month ago

Related to #587

In a recent deployment, we have observed that some (but not all) runners are lost when all Nomad agents restart.

Within this issue, we should identify the Nomad event that notifies Poseidon that a job is lost and will not be restarted nor rescheduled, and deal with it by trying to request a new runner. [Jobs] [Allocations].

This should be fixed together with #602