lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.79k stars 207 forks source link

Full text search (FTS) indices #1195

Open eddyxu opened 1 year ago

eddyxu commented 1 year ago

Sept 9:

Sept 2:

Aug 26th Reduce index file size and improve the indexing performance

Given that we have https://github.com/lancedb/tantivy-object-store ready now, we can start to integrate tantive FTS into the rust core, and offer FTS to js/python/rust bindings.

Because we need to work on a variety of storage systems, we will likely need to vendor and adapt tantivy to meet our needs. Many of the components, such as the tokenizer and scoring can be re-used as is.

wjones127 commented 4 months ago

Maybe worth a look when we implement this: https://github.com/huggingface/tokenizers

wjones127 commented 4 months ago

Got some user feedback on potential API ideas we might want: https://discord.com/channels/1030247538198061086/1197630499926057021/1238721206006317066

BubbleCal commented 2 months ago

What is this for

With the capability of full text search, we can retrieve the document data more efficient, and with BM25 we can rank the results to reach better retrieval quality.

Design

The index consists of 3 parts: TokenSet, InvertedList and DocSet. We will store them on 3 individual files.

We divide the index structure into the three files because it allows us to minimize IO:

Filtering

the execution plan is: image The filtering generates a RowIdMask that presents the rows should be included, the FTS node would predicate each row by RowIdMask and takes k documents with highest scores within the included rows.

For now, we support only to do either FTS or vector search, in the future, we may add a rerank node to score the rows outputed by FTS & vector search to gain higher retrieval quality

Updates

As described above, the index consists of three parts, we'd copy the TokenSet from previous built index, then update it if needs. The TokenSet is almost a dictionary of words, so it won't be costly to copy it.

The InvertedList would be constructed by the new TokenSet as indexing, so does the DocSet.

TODO items

Features

Low Priority

APIs

Docs

Additional items

BubbleCal commented 2 months ago

To get it work as soon as possible, I haven't integrated it into the filter expression, instead, just added a new interface to execute the full text search, may remove this interface once we get the parser ready. Here is a Python example:

import random
import lance
import pyarrow as pa
import string
import tempfile

# generate dataset
n = 1000
ids = range(n)
docs = ["".join(random.choices(string.ascii_letters, k=5)) for _ in range(n)]

id_array = pa.array(ids, type=pa.int64())
# the inverted index supports large string array only
doc_array = pa.array(docs, type=pa.large_string())

table = pa.table({"id": id_array, "doc": doc_array})
temp_dir = tempfile.mkdtemp()
dataset = lance.write_dataset(table, temp_dir)
dataset.create_scalar_index("doc", "INVERTED")

results = dataset.scanner(
    ["id", "doc"],
    limit=10,
    full_text_query=docs[0],
).to_table()
print(results)