golemcloud / golem

Golem is an open source durable computing platform that makes it easy to build and deploy highly reliable distributed systems.
https://learn.golem.cloud/
Apache License 2.0
530 stars 59 forks source link

Tests and fixes for oplog corruption bug #962

Closed vigoo closed 2 months ago

vigoo commented 2 months ago

If a set of conditions were met:

It could happen that the newly opened oplog used a wrong view of what the "last oplog index" is (only looking at the primary layer), and then the next written entry gets the wrong identifier (1). As this (and many following) index is already used and is stored in one of the archive layers, this becomes a corrupt oplog leading to many unexpected issues.

This pull request:

(Also updates wasm-rpc to 1.0.3 as it contains some important stub generator fixes.)

noise64 commented 2 months ago

can getting the last oplog and open be in a race-condition?

vigoo commented 2 months ago

can getting the last oplog and open be in a race-condition?

only in split-brain scenarios (if another executor is writing the oplog)

normally there is only a single instance of Oplog per worker and that's the only way to append the oplog