Currently the correctness of the staged execution hinges on:
The global stage counter in Redis increments past X only after all the actors in stage X have pushed data to their targets
Arrow Flight server serves up data corresponding to stage X-1 before serving up data corresponding to stage X to whoever asks.
This works just fine in normal execution. When there is a failure, there could be an intricate scenario as follows. Consider a left deep join tree, where all intermediates and probe input have stage 0 and build inputs have stage -1. Consider the join node at the top, which has stage 0. Upon normal execution all build inputs have finished and thus stage counter has incremented to 0. Now the join node on the top has not yet executed anything.
Now if it is to execute, the Arrow Flight server will preferentially serve up the build side first. Great.
However if the machine it is one dies, and it is resurrected on another machine, it will ask for any inputs from the build and the intermediate node before it, which could be replayed. Now the global stage counter is 0. If the probe side replayer finishes fast, and this node executes before the build side replayer is done, then we will have a problem.
Currently the correctness of the staged execution hinges on:
This works just fine in normal execution. When there is a failure, there could be an intricate scenario as follows. Consider a left deep join tree, where all intermediates and probe input have stage 0 and build inputs have stage -1. Consider the join node at the top, which has stage 0. Upon normal execution all build inputs have finished and thus stage counter has incremented to 0. Now the join node on the top has not yet executed anything.
Now if it is to execute, the Arrow Flight server will preferentially serve up the build side first. Great.
However if the machine it is one dies, and it is resurrected on another machine, it will ask for any inputs from the build and the intermediate node before it, which could be replayed. Now the global stage counter is 0. If the probe side replayer finishes fast, and this node executes before the build side replayer is done, then we will have a problem.