Full text search (FTS) indices

eddyxu commented 1 year ago

Sept 9:

[ ] https://github.com/lancedb/sophon/issues/2381 @BubbleCal

Sept 2:

[x] https://github.com/lancedb/sophon/issues/2353 @BubbleCal
[x] Add FTS to reindexer and add tests to SaaS integration tests https://github.com/lancedb/sophon/pull/2301 @BubbleCal
[ ] https://github.com/lancedb/sophon/issues/2381 @BubbleCal

Aug 26th Reduce index file size and improve the indexing performance

[x] Reduce index file size to reduce the cold latency https://github.com/lancedb/lance/pull/2786
[ ] Improve loading FTS index performance https://github.com/lancedb/lance/pull/2787
[x] improve indexing performance (less than 15min for MS MARCO 22GB dataset) https://github.com/lancedb/lance/pull/2807

Given that we have https://github.com/lancedb/tantivy-object-store ready now, we can start to integrate tantive FTS into the rust core, and offer FTS to js/python/rust bindings.

Because we need to work on a variety of storage systems, we will likely need to vendor and adapt tantivy to meet our needs. Many of the components, such as the tokenizer and scoring can be re-used as is.

wjones127 commented 4 months ago

Maybe worth a look when we implement this: https://github.com/huggingface/tokenizers

wjones127 commented 4 months ago

Got some user feedback on potential API ideas we might want: https://discord.com/channels/1030247538198061086/1197630499926057021/1238721206006317066

BubbleCal commented 2 months ago

What is this for

With the capability of full text search, we can retrieve the document data more efficient, and with BM25 we can rank the results to reach better retrieval quality.

Design

The index consists of 3 parts: TokenSet, InvertedList and DocSet. We will store them on 3 individual files.

TokenSet is basically a map from token(word) to token_id and the frequency of it.
InvertedList records the row_id and frequency each token occurs for fast retrieving
DocSet records the number of tokens for each row, which is used for BM25 scoring

We divide the index structure into the three files because it allows us to minimize IO:

For appending, we need to load only the previous TokenSet and generate InvertedList and DocSet for the new data
For filtering, we need to load only the TokenSet and InvertedList
For filtering and ranking with BM25, we need to load all the files

Filtering

the execution plan is: The filtering generates a RowIdMask that presents the rows should be included, the FTS node would predicate each row by RowIdMask and takes k documents with highest scores within the included rows.

For now, we support only to do either FTS or vector search, in the future, we may add a rerank node to score the rows outputed by FTS & vector search to gain higher retrieval quality

Updates

As described above, the index consists of three parts, we'd copy the TokenSet from previous built index, then update it if needs. The TokenSet is almost a dictionary of words, so it won't be costly to copy it.

The InvertedList would be constructed by the new TokenSet as indexing, so does the DocSet.

TODO items

Features

[x] Inverted Index https://github.com/lancedb/lance/pull/2526
[x] BM25 https://github.com/lancedb/lance/pull/2526
[x] Push down limit https://github.com/lancedb/lance/pull/2577
[x] WAND algorithm https://github.com/lancedb/lance/pull/2632
[x] Disk-based implementation https://github.com/lancedb/lance/pull/2643
[x] Switch to v2 format for better random access performance https://github.com/lancedb/lance/pull/2643
[ ] Filter expression parser
[x] Support phrase query https://github.com/lancedb/lance/pull/2751
[ ] Block-Max WAND
[ ] Customize Tokenizer
[ ] Boolean Query
[ ] Parallel Indexing
[ ] Fuzzy Match

Low Priority

[ ] Brute-force search if index not created
[ ] Index on multiple fields
[ ] Query on multiple fields

APIs

[x] Rust API https://github.com/lancedb/lance/pull/2577
[x] Python API https://github.com/lancedb/lance/pull/2577
[x] TypeScript API https://github.com/lancedb/lancedb/pull/1483
[x] LanceDB API https://github.com/lancedb/lancedb/pull/1483

Docs

[ ] Benchmark
[x] Docs
[x] Examples

Additional items

[ ] More languages support

BubbleCal commented 2 months ago

To get it work as soon as possible, I haven't integrated it into the filter expression, instead, just added a new interface to execute the full text search, may remove this interface once we get the parser ready. Here is a Python example:

import random
import lance
import pyarrow as pa
import string
import tempfile

# generate dataset
n = 1000
ids = range(n)
docs = ["".join(random.choices(string.ascii_letters, k=5)) for _ in range(n)]

id_array = pa.array(ids, type=pa.int64())
# the inverted index supports large string array only
doc_array = pa.array(docs, type=pa.large_string())

table = pa.table({"id": id_array, "doc": doc_array})
temp_dir = tempfile.mkdtemp()
dataset = lance.write_dataset(table, temp_dir)
dataset.create_scalar_index("doc", "INVERTED")

results = dataset.scanner(
    ["id", "doc"],
    limit=10,
    full_text_query=docs[0],
).to_table()
print(results)

lancedb / lance