Open caes27 opened 4 months ago
Hey there, @caes27! I'm here to help you with any bugs, questions, or contributions you might have. Let's tackle this together while we wait for a human maintainer.
The issue you're encountering is likely due to the refine_factor
parameter in the LanceDBVectorStore
class. This parameter fetches more results than requested and then re-ranks them in memory. If this factor is set, it will fetch more nodes than the similarity_top_k
value specified in your query.
Here's the relevant code snippet:
table = self.connection.open_table(self.table_name)
lance_query = (
table.search(
query=query.query_embedding,
vector_column_name=self.vector_column_name,
)
.limit(query.similarity_top_k)
.where(where)
.nprobes(self.nprobes)
)
if self.refine_factor is not None:
lance_query.refine_factor(self.refine_factor)
In this code, if self.refine_factor
is set, it will refine the results by reading extra elements and re-ranking them in memory. This could result in initially fetching 1080 nodes before any further ranking is applied [1][2].
To address this, you can try adjusting the refine_factor
parameter when initializing the LanceDBVectorStore
:
vector_store = LanceDBVectorStore.from_table(table, refine_factor=None)
Additionally, ensure that the nprobes
parameter is set appropriately. The default value is 20
, which balances accuracy and speed. You can adjust it based on your performance needs:
vector_store = LanceDBVectorStore.from_table(table, nprobes=20)
By fine-tuning these parameters, you should be able to control the number of nodes retrieved and improve the accuracy of your queries.
To continue talking to Dosu, mention @dosu.
@dosu
vector_store = LanceDBVectorStore.from_table()
only takes in a table as a parameter, nothing else.
Speaking about this, can this be a feature added in the future? I've been having to go into the LanceDBVectorStore and manually changing the query_type myself when conducting some evaluation on my RAG system.
@dosu
vector_store = LanceDBVectorStore.from_table()
only takes in a table as a parameter, nothing else.
Speaking about this, can this be a feature added in the future? I've been having to go into the LanceDBVectorStore and manually changing the query_type myself when conducting some evaluation on my RAG system.
You're right, the current LanceDBVectorStore.from_table()
method only takes a table as a parameter and does not accept additional parameters like refine_factor
or nprobes
. This is a limitation in the current implementation.
However, it is feasible to modify the from_table
method to accept these parameters. Here is an example of how the method could be updated to include refine_factor
and nprobes
:
@classmethod
def from_table(cls, table: Any, refine_factor: Optional[int] = None, nprobes: int = 20) -> "LanceDBVectorStore":
"""Create instance from table."""
try:
if not isinstance(
table, (lancedb.db.LanceTable, lancedb.remote.table.RemoteTable)
):
raise Exception("argument is not lancedb table instance")
return cls(table=table, refine_factor=refine_factor, nprobes=nprobes)
except Exception as e:
print("ldb version", lancedb.__version__)
raise
This modification would allow users to specify refine_factor
and nprobes
when creating an instance from a table. The LanceDBVectorStore
class already supports these parameters in its __init__
method [1][2].
For now, you might need to manually adjust these parameters in the LanceDBVectorStore
class. I'll loop in a maintainer to see if this feature can be added in a future update.
@logan-markewich, could you step in and provide your guidance on this feature request?
To continue talking to Dosu, mention @dosu.
@dosu
Update: I ran some more tests and it is not an issue of creating a vector store from a lancedb table, which I thought it might have. Here is some code:
question = "test"
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5")
vector_store = LanceDBVectorStore(uri="./lancedb_RL_test_2", table_name="docs_8.1", query_type="hybrid")
docstore = SimpleDocumentStore()
docstore.add_documents(all_nodes)
storage_context = StorageContext.from_defaults(vector_store=vector_store, docstore=docstore)
index = VectorStoreIndex(nodes=all_leaf_nodes, storage_context=storage_context)
index.storage_context.persist(persist_dir="./RL_test_2_8.1")
retriever = index.as_retriever(similarity_top_k = 2500)
nodes = retriever.retrieve(question)
db = lancedb.connect("lancedb_RL_test_2")
table = db.open_table("docs_8.1")
vector_store2 = LanceDBVectorStore.from_table(table)
index2 = VectorStoreIndex.from_vector_store(vector_store)
retriever2 = index2.as_retriever(similarity_top_k = 2500)
nodes2 = retriever2.retrieve(question)
Keep in mind, "all_leaf_nodes" contains 3558 nodes, but both times I retrieve nodes using the VectorStoreIndex as a retriever, it is being limited to 1498. Any idea of what might be happening? I can see the 3500+ nodes inside of my lancedb table directory.
The retrieval of nodes from your LanceDBVectorStore
is limited to 1498 nodes because of the query.similarity_top_k
parameter in the query
method. This parameter specifies the maximum number of top similar nodes to retrieve.
Here is the relevant code snippet:
lance_query = (
self._table.search(
query=_query,
vector_column_name=self.vector_column_name,
)
.limit(query.similarity_top_k * self.overfetch_factor)
.where(where)
)
The .limit(query.similarity_top_k * self.overfetch_factor)
part of the code limits the number of nodes retrieved to the value specified by query.similarity_top_k
multiplied by self.overfetch_factor
. If this value is set to 1498, then only 1498 nodes will be retrieved, even if there are more nodes stored in the database. Adjusting the query.similarity_top_k
or self.overfetch_factor
values can help retrieve more nodes if needed [1][2].
Additionally, the default value of self.overfetch_factor
in the LanceDBVectorStore
implementation is 1
. It can be adjusted by passing a different value to the overfetch_factor
parameter when initializing the LanceDBVectorStore
object:
vector_store = LanceDBVectorStore(overfetch_factor=2)
This should help you retrieve more nodes if needed [2].
To continue talking to Dosu, mention @dosu.
Hi @caes27 , thanks for reporting the issue.
I tested from integration end and came to the following conclusions (I used hierarchical parser and ingested 768 nodes into the DB) :
len(index.vector_store._table.search().where(None).limit(700).to_pandas())
gives the correct result and returns 700
similarity_top_k
seems to be correctly parsed by lancedb query function. response = index.as_retriever(similarity_top_k = 700).retrieve('test')
returns nodes : 700
len(response)
it returns 234
which seems odd.I am not sure but it seems to be an issue in how the final results are built by llama index retriever API / query engine API, I can see VectorIndexRetriever._build_node_list_from_query_result()
function being called but @logan-markewich could you have a look once as you would have a better idea?
from lancedb integration API end, it seems to be fine, perhaps some minor docstore, storage context issue could be there and I can make the fix if needed but I am not sure what the fix is.
adding the query function debug code snippet :
def query(
self,
query: VectorStoreQuery,
**kwargs: Any,
) -> VectorStoreQueryResult:
"""Query index for top k most similar nodes."""
if query.filters is not None:
if "where" in kwargs:
raise ValueError(
"Cannot specify filter via both query and kwargs. "
"Use kwargs only for lancedb specific items that are "
"not supported via the generic query interface."
)
where = _to_lance_filter(query.filters, self._metadata_keys)
else:
where = kwargs.pop("where", None)
query_type = kwargs.pop("query_type", self.query_type)
_logger.info("query_type :", query_type)
if query_type == "vector":
_query = query.query_embedding
else:
if not isinstance(self._table, lancedb.db.LanceTable):
raise ValueError(
"creating FTS index is not supported for LanceDB Cloud yet. "
"Please use a local table for FTS/Hybrid search."
)
if self._fts_index is None:
self._fts_index = self._table.create_fts_index(
self.text_key, replace=True
)
if query_type == "hybrid":
_query = (query.query_embedding, query.query_str)
elif query_type == "fts":
_query = query.query_str
else:
raise ValueError(f"Invalid query type: {query_type}")
lance_query = (
self._table.search(
query=_query,
vector_column_name=self.vector_column_name,
)
.limit(query.similarity_top_k * self.overfetch_factor)
.where(where)
)
if query_type != "fts":
lance_query.nprobes(self.nprobes)
if query_type == "hybrid" and self._reranker is not None:
_logger.info(f"using {self._reranker} for reranking results.")
lance_query.rerank(reranker=self._reranker)
if self.refine_factor is not None:
lance_query.refine_factor(self.refine_factor)
results = lance_query.to_pandas()
if len(results) == 0:
raise Warning("query results are empty..")
nodes = []
for _, item in results.iterrows():
try:
node = metadata_dict_to_node(item.metadata)
node.embedding = list(item[self.vector_column_name])
except Exception:
# deprecated legacy logic for backward compatibility
_logger.debug(
"Failed to parse Node metadata, fallback to legacy logic."
)
if item.metadata:
metadata, node_info, _relation = legacy_metadata_dict_to_node(
item.metadata, text_key=self.text_key
)
else:
metadata, node_info = {}, {}
node = TextNode(
text=item[self.text_key] or "",
id_=item.id,
metadata=metadata,
start_char_idx=node_info.get("start", None),
end_char_idx=node_info.get("end", None),
relationships={
NodeRelationship.SOURCE: RelatedNodeInfo(
node_id=item[self.doc_id_key]
),
},
)
nodes.append(node)
# _logger.info("nodes :", len(nodes))
print("nodes :", len(nodes)) # this returns the correct no. of nodes as per similarity_top_k
return VectorStoreQueryResult(
nodes=nodes,
similarities=_to_llama_similarities(results),
ids=results["id"].tolist(),
)
Hello @raghavdixit99,
Thank you for helping me, I really appreciate it.
There are a bunch of things that are weird.
I rechunked a smaller set of documents and ingested 3500 nodes into a separate lancedb table. I set similarity_top_k to 1500 and by adding your debugging statement of:
print("nodes :", len(nodes)) # this returns the correct no. of nodes as per similarity_top_k
It correctly showed 1500 nodes being retuned, but in the final response:
response = index.as_retriever(similarity_top_k = 700).retrieve('test')
print(len(response))
This outputted 1488 nodes, so some nodes were lost in this process. It was kinda fascinating how yours went from 700 to 234. But there is also another issue.
Since there is 3500 documents, I wanted to test it with a larger limit/similarity_top_k.
I set it to 2500 and everytime, both by using:
table_nodes = table.search().limit(2500).to_list()
print(len(table_nodes))
response = index.as_retriever(similarity_top_k = 2500).retrieve('test')
print(len(response))
The top piece of code returned 1510 nodes. For the bottom piece of code, the debugging statement added into the query function showed 1510 nodes, and then it went down to 1498.
The limit/similarity_top_k was set to 2500, so what is going on here? I think this a bigger issue than the nodes being lost in the final stages of the retrieval process?
Tagging for visbility: @logan-markewich
@caes27 , a lancedb search : table.search().limit(x)
will return the correct result as thats calling our OSS API which is a simple vector search and has been tested without any issues.
Additionally, I locally tested it via len(index.vector_store._table.search().where(None).limit(None).to_pandas())
and got the entire table(768 nodes) which is the correct result, you can refer to our API reference for more details - https://lancedb.github.io/lancedb/python/python/#lancedb.query.LanceQueryBuilder.limit
Perhaps your table has not ingested all the data or your uri needs a refresh (rm -rf /your_lancedb_path
).
As for the final retrieval results coming less than expected I have already covered that in my comment and tagged Logan, we should wait for his response as it seems like a parsing problem from the base retriever class.
Thanks
Hey @raghavdixit99,
I believe you when you say the table.search().limit(x)
method works lol
I have refreshed the uri multiple times and same issue. Maybe it's a matter of how nodes are being ingested into the lancedb table when you do this:
vector_store = LanceDBVectorStore(uri="./lancedb_RL_test_3", table_name="docs_8.1", query_type="hybrid")
docstore = SimpleDocumentStore()
docstore.add_documents(all_nodes)
storage_context = StorageContext.from_defaults(vector_store=vector_store, docstore=docstore)
index = VectorStoreIndex(nodes=all_leaf_nodes, storage_context=storage_context)
I can't see anywhere else where it can go wrong.
If you have time, maybe you can try it on your end by populating the table with 2000+ nodes and see if you get the same issue?
Thank you!
Did more digging. As I was populating the table little by little, instead of sending it 25000+ nodes at once, I realized something.
Suppose my table has 500 nodes in it currently and I want to add 300 more nodes to the table. I run the following code:
vector_store = LanceDBVectorStore(uri='lancedb_TEST', table_name='docs', query_type='hybrid')
docstore = SimpleDocumentStore()
docstore.add_documents(all_nodes)
storage_context = StorageContext.from_defaults(vector_store=vector_store, docstore=docstore)
index = VectorStoreIndex(nodes=all_leaf_nodes, storage_context=storage_context)
After this is done, this should mean there is 800 nodes in the lancedb table, but after I execute the following code:
db3 = lancedb.connect("lancedb_TEST")
table3 = db3.open_table("docs")
vector_store3 = LanceDBVectorStore.from_table(table3)
index3 = VectorStoreIndex.from_vector_store(vector_store3)
index3.insert_nodes(all_leaf_nodes)
retriever3 = index3.as_retriever(similarity_top_k = 1500)
nodes3 = retriever3.retrieve(question)
nodes3 is of length 300, which were the nodes I just added. It ignores the 500 nodes that were in the lancedb table previously.
Is this not the correct way to add nodes to an existing lancedb table? I appreciate any help, thank you!
Hi @caes27 Thanks for the update.
Since you are trying to iteratively ingest data you should try changing the mode to “append” by default the table overwrites the data could be the reason for such behavior.
vector_store = LanceDBVectorStore(uri='lancedb_TEST', table_name='docs', query_type='hybrid', mode=“append”)
Hello @raghavdixit99,
I think I might have found the issue that was causing problems.
First, I noticed some faulty logic in the LanceDBVectorStore's "add" method, and fixed it myself. At the same time, I thought about upgrading the package and this also fixed it lol.
I also tried your solution yesterday, and it works if the table already exists and has some sort of data in it. However, it throws an error when the table is empty. So that is a work around, but this fix in the "add" method seemed to solve ingesting data into a fresh table:
Previous:
if self._table is None:
self._table = self._connection.create_table(
self._table_name, data, mode=self.mode
)
else:
if self.api_key is None:
self._table.add(data, mode=self.mode)
else:
self._table.add(data)
After:
if self._table is None:
self._table = self._connection.create_table(
self._table_name, data, mode=self.mode
)
else:
if self.api_key is None:
self._table.add(data, mode="append")
else:
self._table.add(data)
From what I saw, the data was ingested in batches and when the second batch came around, because it was in "overwrite" mode, the second batch's data would completely wipe the first batch's data and so on.
The other problem with LlamaIndex losing some nodes in the retrieval process still persists, so I'm still waiting.
Thanks again Raghav for your help throughout this whole thread.
Hi @caes27 that is not faulty logic, you are hardcoding the mode argument and we have added that as per user’s requirement/ input. Please follow the usage as per my last comment, rest we are waiting on Logans response.
that is not faulty logic, you are hardcoding the mode argument and we have added that as per user’s requirement/ input
@raghavdixit99 I think the problem @caes27 is pointing out is that "append"
is not a valid mode for create_table
(I see the below error when the table does not exist yet).
.venv/lib/python3.11/site-packages/lancedb/db.py", line 414, in create_table
raise ValueError("mode must be either 'create' or 'overwrite'")
The llama index code is using the mode
parameter for both create_table
and table.add
, but the values LanceDB expects for each are different. For create_table, valid modes are "create"
or "overwrite"
, whereas for table.add, the mode must be "overwrite"
or "append"
. This works OK for "overwrite"
since the modes overlap, but it doesn't work for "append"
.
docstore.add_documents(all_nodes) storage_context = StorageContext.from_defaults(vector_store=vector_store, docstore=docstore) index = VectorStoreIndex(nodes=all_leaf_nodes, storage_context=storage_context)
Hi @caes27 In your example, I saw all_nodes
passed to docstore
and all_leaf_nodes
passed to VectorStoreIndex
, is this intended, or this could be the reason of your issue?
Hello @raghavdixit99,
I think I might have found the issue that was causing problems.
First, I noticed some faulty logic in the LanceDBVectorStore's "add" method, and fixed it myself. At the same time, I thought about upgrading the package and this also fixed it lol.
I also tried your solution yesterday, and it works if the table already exists and has some sort of data in it. However, it throws an error when the table is empty. So that is a work around, but this fix in the "add" method seemed to solve ingesting data into a fresh table:
Previous:
if self._table is None: self._table = self._connection.create_table( self._table_name, data, mode=self.mode ) else: if self.api_key is None: self._table.add(data, mode=self.mode) else: self._table.add(data)
After:
if self._table is None: self._table = self._connection.create_table( self._table_name, data, mode=self.mode ) else: if self.api_key is None: self._table.add(data, mode="append") else: self._table.add(data)
From what I saw, the data was ingested in batches and when the second batch came around, because it was in "overwrite" mode, the second batch's data would completely wipe the first batch's data and so on.
The other problem with LlamaIndex losing some nodes in the retrieval process still persists, so I'm still waiting.
Thanks again Raghav for your help throughout this whole thread.
Hi @logan-markewich @raghavdixit99 @spearki
I can confirm this is a bug, run into the same issue when first creating a VectorDB from scratch.
What happens is self.mode
default to overwrite during initiation.
https://github.com/run-llama/llama_index/blob/f7676375ea8d80ce10b92d4e6de73b1bcb77cbc9/llama-index-integrations/vector_stores/llama-index-vector-stores-lancedb/llama_index/vector_stores/lancedb/base.py#L354C1-L355C1
but since data ingested in batch, and latest batch keep overwriting previous, in the end VectorDB will be initiated with only 'input_record_size%insert_batch_size` records
Could you kindly provide a patch update to fix this issue? @caes27 's solution solved it for me
It would be great if @caes27 or @manfredwang093 can open a PR for this, and maybe include a unit test :) Tbh there have been several updates in lancedb since this issue was opened, I'm not even sure if this is an issue still
Question Validation
Question
I don't know what I am doing wrong. I chunked a few hundred documents using the HierarchicalNodeParser and stored them in a lanceDB database using this guide. It has about 24000 leaf nodes in it.
If I want to query the data, I use the code down below:
What this seems to be doing is initially grabbing the same exact 1080 nodes from the database, then ranking them based on vector similarity to query. I tried tuning the overfetch_factor and nprobes parameters of the LanceDBVectorStore, but this seems to do nothing. I am very confused on what I might be doing wrong? Any help?