stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.68k stars 355 forks source link

Searcher.search intermittent bug after IndexUpdater.persist_to_disk #261

Open jessiejuachon opened 9 months ago

jessiejuachon commented 9 months ago

Searcher.search produces different results after IndexUpdater.persist_to_disk

  1. Call Searcher.search on an index. Notice the returned passage ids
  2. Call IndexUpdater.add, remove and persist_to_disk several times for the same set of passages (no changes to content, just new pids are removed/generated for the same passages). See that the Index files have been updated.
  3. Call Searcher.search with the same query that was passed in step 1. Notice that it correctly returns the most recently inserted PID from step 2.
  4. Restart application (or re-initialize the Searcher and IndexUpdater, passing the same path to the now updated index).
  5. Call Searcher.search with the same query that was passed in step 1. Notice that it correctly returns the same result as in step 3.
  6. Call IndexUpdater.add once for the same set of passages in step 2
  7. Call Searcher.search with the same query that was passed in steps 1 and 3.

ERROR/S:

Note: If step 4 is skipped, then there is no error.

Test environment: CPU

TakshPanchal commented 3 months ago

I think the issue is IndexUpdater.persist_to_disk updates only embedding vectors. Colbert's index folder has collection.json file in which all the docs are saved. IndexUpdater.persist_to_disk should also update those collections, after updating, the index searcher should be updated with the latest collection.

pydv9991 commented 2 months ago

any update on this issue?