quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
12.22k stars 677 forks source link

Allow single-threaded addition of documents to existing index without spawning new threads #1753

Open GeeWee opened 1 year ago

GeeWee commented 1 year ago

Is your feature request related to a problem? Please describe. Continuing my work on getting tantivy to work server-side with WASM (related issues #1751 #541 ), I would like to index dynamically added documents. In essence I have a large set of documents I can pre-index in a build phase, however each user also has some documents that are loaded dynamically from a database.

I would imagine that normally I could do something like this:

let index = Index::open(ram_directory).unwrap();
let mut index_writer = index.writer(3_000_000).unwrap();
index_writer.add_document(my_document).unwrap();
index_writer.commit().unwrap();

And then search the index. I realize that this might leak documents from one tenant to another, but as this index is rebuilt in-memory on each request and dropped after, this isn't a large concern.

However, as WASM is single-threaded I'm unable to actually get this to work as it seems all the IndexWriters require a thread pool of some sort.

I have tried both index.writer() and index.writer_with_num_threads with both 1 and 0 threads. I've even delved into the undocumentedSingleSegmentIndexWriter Even though it seems to suggest it's only for creating an index with a Single Segment and not adding to an existing index, I figured I would give it a go.

However trying to instantiate it gives me the following IoError

IoError(
    Error {
        kind: Unsupported,
        message: "operation not supported on this platform",
    },
)

Which I think might be threadpool-related, but I am unable to get a stacktrace to confirm.

Describe the solution you'd like I think in essence I'm asking if there's any way to accomplish what I want to do.

ppodolsky commented 1 year ago

Tantivy already has single threaded index writer, dunno exact name. Check https://github.com/izihawa/summa There are already implemented bindings for WASM for Tantivy

GeeWee commented 1 year ago

Hmm, Summa seems to use SingleSegmentIndexWriter which which for some reason doesn't seem to work for my use-case.

ppodolsky commented 1 year ago

Sorry, missed your point about SSIW in the first post. It will be hard without stack trace, but checklist is:

GeeWee commented 1 year ago

Thanks for your thoughts! I was unable to procure a stacktrace, but after fetching down tantivy and adding breakpoints everywhere I've managed to figure out the SSIW problem.

My problem was that my IndexSettings had docstore_compress_dedicated_thread=true (as is the default) and I had not realized that. After changing that to false, hooray - it works!

Now for the next issue - SSIW doesn't allow adding documents to an existing Index - as it overrides the meta properties of the index to only contain the segment it writes. This means adding documents and calling finalize will override any other segments in the index.

However, if I create my own commit method inside SSIW that looks like the below snippet, then it seems to work and add documents successfully without overriding existing documents.

pub fn commit(self) -> crate::Result<Index> {
        let max_doc = self.segment_writer.max_doc();
        self.segment_writer.finalize()?;

        let segment: Segment = self.segment.with_max_doc(max_doc);
        let index = segment.index();

        let mut segments = index.searchable_segment_metas()?;

        segments.push(segment.meta().clone());

        let index_meta = IndexMeta {
            index_settings: index.settings().clone(),
            segments,
            schema: index.schema(),
            opstamp: 0,
            payload: None,
        };

        save_metas(&index_meta, index.directory())?;
        index.directory().sync_directory()?;
        Ok(segment.index().clone())
    }

It is essentially the same as the finalize method except it carries over the meta segments already existing in the index. If you would accept this method in a PR, I would be very happy to provide one, but I'm still new to the internals of tantivy, so I'm not sure I'm "doing it right"