Open eddyxu opened 1 year ago
Maybe worth a look when we implement this: https://github.com/huggingface/tokenizers
Got some user feedback on potential API ideas we might want: https://discord.com/channels/1030247538198061086/1197630499926057021/1238721206006317066
With the capability of full text search, we can retrieve the document data more efficient, and with BM25 we can rank the results to reach better retrieval quality.
The index consists of 3 parts: TokenSet
, InvertedList
and DocSet
. We will store them on 3 individual files.
TokenSet
is basically a map from token(word) to token_id and the frequency of it.InvertedList
records the row_id and frequency each token occurs for fast retrievingDocSet
records the number of tokens for each row, which is used for BM25 scoringWe divide the index structure into the three files because it allows us to minimize IO:
TokenSet
and generate InvertedList
and DocSet
for the new dataTokenSet
and InvertedList
the execution plan is:
The filtering generates a RowIdMask
that presents the rows should be included, the FTS node would predicate each row by RowIdMask
and takes k documents with highest scores within the included rows.
For now, we support only to do either FTS or vector search, in the future, we may add a rerank node to score the rows outputed by FTS & vector search to gain higher retrieval quality
As described above, the index consists of three parts, we'd copy the TokenSet
from previous built index, then update it if needs. The TokenSet
is almost a dictionary of words, so it won't be costly to copy it.
The InvertedList
would be constructed by the new TokenSet
as indexing, so does the DocSet
.
To get it work as soon as possible, I haven't integrated it into the filter expression, instead, just added a new interface to execute the full text search, may remove this interface once we get the parser ready. Here is a Python example:
import random
import lance
import pyarrow as pa
import string
import tempfile
# generate dataset
n = 1000
ids = range(n)
docs = ["".join(random.choices(string.ascii_letters, k=5)) for _ in range(n)]
id_array = pa.array(ids, type=pa.int64())
# the inverted index supports large string array only
doc_array = pa.array(docs, type=pa.large_string())
table = pa.table({"id": id_array, "doc": doc_array})
temp_dir = tempfile.mkdtemp()
dataset = lance.write_dataset(table, temp_dir)
dataset.create_scalar_index("doc", "INVERTED")
results = dataset.scanner(
["id", "doc"],
limit=10,
full_text_query=docs[0],
).to_table()
print(results)
Nov 18
Nov 11
Nov 4
Oct 28
Oct 21
Oct 14
Oct 7
Sept 9
Sept 2
Aug 26th Reduce index file size and improve the indexing performance
Given that we have https://github.com/lancedb/tantivy-object-store ready now, we can start to integrate tantive FTS into the rust core, and offer FTS to js/python/rust bindings.
Because we need to work on a variety of storage systems, we will likely need to vendor and adapt tantivy to meet our needs. Many of the components, such as the tokenizer and scoring can be re-used as is.