nsoft / jesterj

Document Ingestion Framework for Search Systems
Apache License 2.0
34 stars 33 forks source link

Restarting a plan with non-deterministic routers can lead to errors #194

Closed nsoft closed 1 year ago

nsoft commented 1 year ago

We don't currently account for non-deterministic routers properly. The following scenario is possible:

  1. Plan is running and processing sending documents through a round robin router with N downstream destinations
  2. Plan is halted,killed, or machine crashes after document D has passed the round robin router but before it exits the final downstream output destination O1.
  3. Plan is restarted and scans for stranded documents
  4. D is detected as stranded, with O1 as the pending destination, thus only O1 is listed on the document as an intended destination.
  5. D arrives at the round robin router which has O1, O2 and O3 as potential destinations
  6. the round robin router picks O2 and removes any destinations that are not downstream of the next step that leads to O2, thereby removing O1 as a destinaton from D

This creates an invalid Document that has no destinations, and that will error out when it reaches the final downstream step.

nsoft commented 1 year ago

The above can be solved fairly easily for the most common cases by ensuring that any non deterministic router for which O1 is a downstream step, adds all of it's eligible down stream steps back onto the document in step 4. This is a manifestation of our design in which any destination down stream of a non-deterministic router is considered equivalent.

I can imagine some scenarios where this is not a complete solution, for example fan-out to a heavy step early in the plan with rejoining paths followed by several outputs in series... In such a design our assumptions will cause at least once delivery for all but the final output destination step, even if at most once was the intended contract. That can be fixed as a follow on issue

nsoft commented 1 year ago

Actually, my simple solution fails in a few other cases too, so now that I'm into it, I also realize it's not that hard to fully solve this, so probably no follow on ticket will be needed.