nsoft / jesterj

Document Ingestion Framework for Search Systems
Apache License 2.0
35 stars 33 forks source link

Scanner Deletion Detection #175

Open nsoft opened 1 year ago

nsoft commented 1 year ago

At the moment none of our scanners have the ability to detect if a previously indexed document has disappeared. IIRC the old version of File Scanner that was based on directory watches did have this, but it had to be scrapped due to issues with the JDK implementation (see #130).

initial thoughts:

  1. Build up a data structure to hold the list of Id's seen during a scan (pick an efficient one)
  2. At the end of the scan identify any not seen during the scan, and then if the last status for that doc is not "delete" or "error" send a delete.
  3. Serialize & persist this structure at the end of each scan.
  4. When starting up a scanner check for and load the serialized structure

Also, make sure processor related documentation/javadocs clearly mention the possibility that the document may represent a deletion (often processors will want to ignore these documents), and make sure our provided implementations have an option to ignore deletes (or not).