quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
12.19k stars 676 forks source link

Near Real Time indexing. #494

Open fulmicoton opened 5 years ago

fulmicoton commented 5 years ago

Use case

A lot of users or potential users of tantivy want to use tantivy in client applications. (e.g. indexing the web page they see.).

They really do not care much about data ingestion, as they handle live-human generated data, but they do care about finding their data quickly after ingesting it.

Actually .commit() after every single document inserted is not much of a problem, but the creation of a new segment and the associated file might be a bit too aggressive.

Proposed solution

We introduce the notion of a soft commit. soft commits acts similarly to commits, except the meta.json file is not written. Soft commits is implemented as another SegmentRegister in the SegmentManager.

We then introduce a .persist() method in directories as well. The role of the method is to ensure that the data is persisted and blocks until this is done. (returning a future is ok too)

.prepare commit(bool) would then get an extra boolean that says whether this is a soft commit or not. On non-soft prepare commit, the directory .persist() method is called. On commit() , meta.json is written.

We also need to introduce a NRTDirectory that consists in a RAMDirectory masking another directory. ( in most case an MMapDirectory). All writes (except committed segment merges?) go to the RAMDirectory.

On .persist() this in RAM data gets written to the MMapDirectory.

In the future, we will probably want to start automatically writing file when they exceed a given size.

petr-tik commented 5 years ago

Hey @diggsey, IIRC, you want to make newly indexed documents searchable immediately.

Does this ticket capture and address the use case you were asking me about at the Rust London meetup after my presentation about tantivy?

If so, can you please share more detail and numbers of your use case?

Constraints and use cases?

Your current infrastructure takes too much time to make new documents available and I was wondering:
What is the average and variance in those times? High std dev or a low std dev? How predictable is the performance from load, peak times or client? How big are your corpora of documents? From 1GB to 100s of GB? How many corpora? What is the potential scale of your product – bigger corpora or more corpora of the same size, or both?
Are you read-heavy, write-heavy or roughly similar? How much does it change with time of day/year? Do you run on-prem or in public cloud? How complex are the queries that users run? Outside the 1ms delay, what are your other pain points with current indexing and search infrastructure?

Overall setup:

Each client loads, indexes and searchers only their own documents.

You partition different clients’ documents (I am guessing that ES chooses a partition by client_id as key?) to separate instances.

Appreciate you are mostly happy with your current search infrastructure. From your description, it sounds like the kind of problem tantivy (and Rust) can solve well in the future, if we direct our efforts towards it now, so every little helps(tm).

Diggsey commented 5 years ago

Does this ticket capture and address the use case you were asking me about at the Rust London meetup after my presentation about tantivy?

I believe so, although being unfamiliar with the implementation details of tantivy, some of the description is difficult to follow.

Your current infrastructure takes too much time to make new documents available and I was wondering: What is the average and variance in those times? High std dev or a low std dev? How predictable is the performance from load, peak times or client?

We use elasticsearch, and it's configured with a fixed "refresh interval" - the shortest interval you can set is 1 second. This means that if we need to create and index a series of documents, where each document depends on the previous one, we can only ever index 1 document per second.

How big are your corpora of documents? From 1GB to 100s of GB? How many corpora? What is the potential scale of your product – bigger corpora or more corpora of the same size, or both?

Document sizes vary quite a lot, but maybe a few KB would be the average? We have ~3 million documents, but we also have nested sub-documents.

Are you read-heavy, write-heavy or roughly similar? How much does it change with time of day/year?

Roughly similar. Documents are not immutable and are very frequently updated immediately after creation. Rate of updates for a document decreases with time, but reads will stay fairly consistent.

Do you run on-prem or in public cloud?

Public cloud.

How complex are the queries that users run?

Fairly complicated? We condition on nested documents and things like that, and we query in a few different ways. (For example, some users have permissions to view only some documents, and we push some of the permissions logic down into the search query so that pagination can still work somewhat accurately).

Outside the 1ms delay, what are your other pain points with current indexing and search infrastructure?

1ms would be fine, it's the 1-second delay that is the problem 😛 . Other than that the only pain points are performance and memory usage.

Each client loads, indexes and searchers only their own documents.

Correct. But a client, in this case, is a company with potentially many users.

You partition different clients’ documents (I am guessing that ES chooses a partition by client_id as key?) to separate instances.

We use a client ID for the ES routing parameter.

justmao945 commented 3 years ago

Hi my dear friends, any updates on this feature ? NEAR_REAL_TIME reload seems is useful to avoid too many new segments or too often merges.