marsupialtail / quokka

Making data lake work for time series
https://marsupialtail.github.io/quokka/
Apache License 2.0
1.1k stars 60 forks source link

Very subtle bug in fault tolerance for staged execution (not in release yet, fix before next release) #31

Closed marsupialtail closed 1 year ago

marsupialtail commented 1 year ago

Currently the correctness of the staged execution hinges on:

This works just fine in normal execution. When there is a failure, there could be an intricate scenario as follows. Consider a left deep join tree, where all intermediates and probe input have stage 0 and build inputs have stage -1. Consider the join node at the top, which has stage 0. Upon normal execution all build inputs have finished and thus stage counter has incremented to 0. Now the join node on the top has not yet executed anything.

Now if it is to execute, the Arrow Flight server will preferentially serve up the build side first. Great.

However if the machine it is one dies, and it is resurrected on another machine, it will ask for any inputs from the build and the intermediate node before it, which could be replayed. Now the global stage counter is 0. If the probe side replayer finishes fast, and this node executes before the build side replayer is done, then we will have a problem.

marsupialtail commented 1 year ago

Fixed in ee40367c51a122728b6cb5823fbb5664d576c014