golemcloud / golem

Golem is an open source durable computing platform that makes it easy to build and deploy highly reliable distributed systems.
https://learn.golem.cloud/
Apache License 2.0
530 stars 59 forks source link

Panic during recovery is reported as divergence #1013

Open vigoo opened 1 month ago

vigoo commented 1 month ago

It is possible (most likely due to a bug introduced in Golem) that something triggers a panic in the user code during worker recovery, that previously succeeded.

In this case the panic handler calls a set of host functions which are not what is in the oplog (for example getting the current environment, stdin/out, writing to the output).

We cannot capture the information this panic handler would print, because we immediately detect a divergence and fail the recovery with it.

Ideally we should be able to see why the worker recovery panicks in order to fix the root issue in the executor. This should be considered when implementing #980