nsoft / jesterj

Document Ingestion Framework for Search Systems
Apache License 2.0
34 stars 33 forks source link

Handle fault tolerance delivery guarantees for child documents #177

Open nsoft opened 1 year ago

nsoft commented 1 year ago

As it stands (and as it will be for 1.0) the Fault tolerance implementation attempts to ensure at most once delivery in a manner that approaches only once delivery. Failure to deliver should be limited to cases where the system happens to have a power-cord style failure between the recording of send and the actual send of the document.

In it's current form it does not properly account for child document creation however, so any child documents generated within a Processor (any case where the processor returns more than one document) may re-attempt all children if any children have failed.

One possible work around for this would be to write the children to disk (java serialization), and then re-read them with a filescanner. The proper fix for this will be to identify child producing processors and treat them as if they were scanners. (and migrate terminology to 'document source' vs 'scanner')

Calling this a bug, but it will be a known issue for 1.0 since many use cases don't involve producing child documents and the system should be perfectly functional for any other use case.