openHPI / poseidon

Scalable task execution orchestrator for CodeOcean
MIT License
8 stars 1 forks source link

Failed loading job. Skipping... #627

Closed sentry-io[bot] closed 1 week ago

sentry-io[bot] commented 3 months ago

Sentry Issue: POSEIDON-5Z

Failed loading job. Skipping...

This error occurred first on July 2, 04:42 CEST on production. Is it related to our unattended upgrades?

mpass99 commented 1 week ago

This error happens on the environment recovery. The sub-error of the production event error loading runner portMappings: error querying allocation for runner 29-c8c06eee-381c-11ef-bc48-fa163e1390db: no allocation found shows that the error happened when Poseidon was recovering the port mappings of the job and could not find an allocation for the job. This occurs when the allocation suddenly stops during the recovery process.

Together with the unattended upgrades the following scenario seems plausible: The unattended upgrade first restarted Poseidon or a Nomad Server. This triggered the Poseidon recovery process that lists all Nomad Jobs. Shortly after, the unattended upgrade restarted a Nomad Agent. This led to the stop of the allocation (and hopefully a rescheduling/migration). However, at that moment there was no allocation to crawl the port mapping from and the error got thrown.

MrSerth commented 1 week ago

The Sentry issue is mostly a warning that can occur during runner recovery. It indicates that the runner was not found in a subsequent request to get more runner details. Hence, it is neither added to Poseidon's list of available runners nor removed actively from Nomad. We assume that the corresponding Nomad allocation is either in the process of restarting (more likely) or was terminated during recovery (less likely). If the allocation is restarted, it will be recovered the next time Poseidon starts, so that the error is "self-healing".

For now, there is nothing to do until the issue escalates and happens very often. We are closing this issue in the meantime.