quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
11.41k stars 627 forks source link

Does tantivy::IndexWriter support multi-process? #2408

Closed cyccbxhl closed 1 month ago

cyccbxhl commented 1 month ago

I'm considering an integration of Tantivy with PostgreSQL, which has a multi-process architecture where each insert/update/delete/vacuum operation is handled by a separate process. However, Tantivy's IndexWriter cannot be initialized simultaneously by multiple processes in the same directory due to the INDEX_WRITER_LOCK.

To enable concurrent insert/delete operations, I'm thinking of removing the INDEX_WRITER_LOCK when initializing the tantivy::IndexWriter within the PostgreSQL insert backend process. Instead, I plan to introduce a read-write lock located in shared memory. This lock would be acquired as a read lock during add_document/delete_term operations and as a write lock during commit operations (the garbage collection process would also acquire a write lock).

Would this approach violate any internal implementation contracts of Tantivy, and is it feasible?

I'm looking for guidance on whether this strategy could potentially disrupt Tantivy's internal mechanisms and whether it aligns with Tantivy's design principles.

fulmicoton commented 1 month ago

Would this approach violate any internal implementation contracts of Tantivy, and is it feasible?

It will most likely not work yes.

The problem has to do with the delete and commit work. Deletes are performed right before serialization.

The reason something like

 - add_doc(1)
 - delete_doc(1)
 - delete_doc(2)
 - add_doc(2)

work the way you expect, is because we attach an opstamp to each document and each delete operation, to know in which order those operations happened.

With your scheme, two concurrent writes could end up with very different outcomes.

 - add_doc(1)
 - add_doc(2)
 - delete_doc(1)
 - delete_doc(2)

You could end up with no docs, doc1, doc2, doc1 and doc2 in the resulting tantivy index.

It will NOT look like the transaction were executed in the order of them taking the write lock.

cyccbxhl commented 1 month ago

The problem has to do with the delete and commit work. Deletes are performed right before serialization.

Can you explain more about it?

Tantivy's commit would be called in pg's commit command, I actually don't need tantivy operations executed in the order of them taking the write lock, I need them meet the RC(Read Committed) transaction isolation level: Only committed operation/data is visible for other concurrent transaction. The behavior I expect: (1)

image

Because the doc1 and doc2 is invisible for delete ops in txn2, when txn1 and txn2 all committed, there are doc1 and doc2 in the resulting tantivy index; (2)

image

when txn1 and txn2 all committed, there are no doc in the resulting tantivy index;

Can tantivy be able to do that?

fulmicoton commented 1 month ago

Actually I think you are right it might work.

cyccbxhl commented 1 month ago

Thank you very much. I'll try implementing the code based on this plan first to see if there are any other issues. There might be some questions I will need to ask you later. I will close this issue for now.