[Question]: Index created from a LanceDBVectorStore is not returning all nodes?

run-llama / llama_index

LlamaIndex is a data framework for your LLM applications

https://docs.llamaindex.ai

MIT License

36.91k stars 5.29k forks source link

[Question]: Index created from a LanceDBVectorStore is not returning all nodes? #14435

Open caes27 opened 4 months ago

caes27 commented 4 months ago

Question Validation

[x] I have searched both the documentation and discord for an answer.

Question

I don't know what I am doing wrong. I chunked a few hundred documents using the HierarchicalNodeParser and stored them in a lanceDB database using this guide. It has about 24000 leaf nodes in it.

If I want to query the data, I use the code down below:

embed_model = OpenAIEmbedding(model="text-embedding-3-large")

db = lancedb.connect("./lancedb_RL_test_2")
Settings.embed_model = embed_model
table = db.open_table("docs_OpenAI")
vector_store = LanceDBVectorStore.from_table(table)
index = VectorStoreIndex.from_vector_store(vector_store)
storage_context = StorageContext.from_defaults(persist_dir="./RL_test_2")
base_retriever = index.as_retriever(similarity_top_k=10000)

question = "test"
nodes = base_retriever.retrieve(question)

What this seems to be doing is initially grabbing the same exact 1080 nodes from the database, then ranking them based on vector similarity to query. I tried tuning the overfetch_factor and nprobes parameters of the LanceDBVectorStore, but this seems to do nothing. I am very confused on what I might be doing wrong? Any help?

dosubot[bot] commented 4 months ago

Hey there, @caes27! I'm here to help you with any bugs, questions, or contributions you might have. Let's tackle this together while we wait for a human maintainer.

The issue you're encountering is likely due to the refine_factor parameter in the LanceDBVectorStore class. This parameter fetches more results than requested and then re-ranks them in memory. If this factor is set, it will fetch more nodes than the similarity_top_k value specified in your query.

Here's the relevant code snippet:

table = self.connection.open_table(self.table_name)
lance_query = (
    table.search(
        query=query.query_embedding,
        vector_column_name=self.vector_column_name,
    )
    .limit(query.similarity_top_k)
    .where(where)
    .nprobes(self.nprobes)
)

if self.refine_factor is not None:
    lance_query.refine_factor(self.refine_factor)

In this code, if self.refine_factor is set, it will refine the results by reading extra elements and re-ranking them in memory. This could result in initially fetching 1080 nodes before any further ranking is applied ^[1]^[2].

To address this, you can try adjusting the refine_factor parameter when initializing the LanceDBVectorStore:

vector_store = LanceDBVectorStore.from_table(table, refine_factor=None)

Additionally, ensure that the nprobes parameter is set appropriately. The default value is 20, which balances accuracy and speed. You can adjust it based on your performance needs:

vector_store = LanceDBVectorStore.from_table(table, nprobes=20)

By fine-tuning these parameters, you should be able to control the number of nodes retrieved and improve the accuracy of your queries.

To continue talking to Dosu, mention @dosu.

caes27 commented 4 months ago

@dosu

vector_store = LanceDBVectorStore.from_table() only takes in a table as a parameter, nothing else.

Speaking about this, can this be a feature added in the future? I've been having to go into the LanceDBVectorStore and manually changing the query_type myself when conducting some evaluation on my RAG system.

caes27 commented 4 months ago

@dosu

vector_store = LanceDBVectorStore.from_table() only takes in a table as a parameter, nothing else.

dosubot[bot] commented 4 months ago

You're right, the current LanceDBVectorStore.from_table() method only takes a table as a parameter and does not accept additional parameters like refine_factor or nprobes. This is a limitation in the current implementation.

However, it is feasible to modify the from_table method to accept these parameters. Here is an example of how the method could be updated to include refine_factor and nprobes:

@classmethod
def from_table(cls, table: Any, refine_factor: Optional[int] = None, nprobes: int = 20) -> "LanceDBVectorStore":
    """Create instance from table."""
    try:
        if not isinstance(
            table, (lancedb.db.LanceTable, lancedb.remote.table.RemoteTable)
        ):
            raise Exception("argument is not lancedb table instance")
        return cls(table=table, refine_factor=refine_factor, nprobes=nprobes)
    except Exception as e:
        print("ldb version", lancedb.__version__)
        raise

This modification would allow users to specify refine_factor and nprobes when creating an instance from a table. The LanceDBVectorStore class already supports these parameters in its __init__ method ^[1]^[2].

For now, you might need to manually adjust these parameters in the LanceDBVectorStore class. I'll loop in a maintainer to see if this feature can be added in a future update.

@logan-markewich, could you step in and provide your guidance on this feature request?

To continue talking to Dosu, mention @dosu.

caes27 commented 4 months ago

@dosu

Update: I ran some more tests and it is not an issue of creating a vector store from a lancedb table, which I thought it might have. Here is some code:

question = "test"

Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5")
vector_store = LanceDBVectorStore(uri="./lancedb_RL_test_2", table_name="docs_8.1", query_type="hybrid")
docstore = SimpleDocumentStore()
docstore.add_documents(all_nodes)
storage_context = StorageContext.from_defaults(vector_store=vector_store, docstore=docstore)
index = VectorStoreIndex(nodes=all_leaf_nodes, storage_context=storage_context)
index.storage_context.persist(persist_dir="./RL_test_2_8.1")

retriever = index.as_retriever(similarity_top_k = 2500)
nodes = retriever.retrieve(question)

db = lancedb.connect("lancedb_RL_test_2")
table = db.open_table("docs_8.1")
vector_store2 = LanceDBVectorStore.from_table(table)
index2 = VectorStoreIndex.from_vector_store(vector_store)

retriever2 = index2.as_retriever(similarity_top_k = 2500)
nodes2 = retriever2.retrieve(question)

Keep in mind, "all_leaf_nodes" contains 3558 nodes, but both times I retrieve nodes using the VectorStoreIndex as a retriever, it is being limited to 1498. Any idea of what might be happening? I can see the 3500+ nodes inside of my lancedb table directory.

dosubot[bot] commented 4 months ago

The retrieval of nodes from your LanceDBVectorStore is limited to 1498 nodes because of the query.similarity_top_k parameter in the query method. This parameter specifies the maximum number of top similar nodes to retrieve.

Here is the relevant code snippet:

lance_query = (
    self._table.search(
        query=_query,
        vector_column_name=self.vector_column_name,
    )
    .limit(query.similarity_top_k * self.overfetch_factor)
    .where(where)
)

The .limit(query.similarity_top_k * self.overfetch_factor) part of the code limits the number of nodes retrieved to the value specified by query.similarity_top_k multiplied by self.overfetch_factor. If this value is set to 1498, then only 1498 nodes will be retrieved, even if there are more nodes stored in the database. Adjusting the query.similarity_top_k or self.overfetch_factor values can help retrieve more nodes if needed ^[1]^[2].

Additionally, the default value of self.overfetch_factor in the LanceDBVectorStore implementation is 1. It can be adjusted by passing a different value to the overfetch_factor parameter when initializing the LanceDBVectorStore object:

vector_store = LanceDBVectorStore(overfetch_factor=2)

This should help you retrieve more nodes if needed ^[2].

To continue talking to Dosu, mention @dosu.

raghavdixit99 commented 4 months ago

Hi @caes27 , thanks for reporting the issue.

I tested from integration end and came to the following conclusions (I used hierarchical parser and ingested 768 nodes into the DB) :

len(index.vector_store._table.search().where(None).limit(700).to_pandas()) gives the correct result and returns 700
I added a print statement /logged the len(results) fetched in the query function and similarity_top_k seems to be correctly parsed by lancedb query function. response = index.as_retriever(similarity_top_k = 700).retrieve('test') returns nodes : 700
but when I check len(response) it returns 234 which seems odd.

I am not sure but it seems to be an issue in how the final results are built by llama index retriever API / query engine API, I can see VectorIndexRetriever._build_node_list_from_query_result() function being called but @logan-markewich could you have a look once as you would have a better idea?

from lancedb integration API end, it seems to be fine, perhaps some minor docstore, storage context issue could be there and I can make the fix if needed but I am not sure what the fix is.

adding the query function debug code snippet :

    def query(
        self,
        query: VectorStoreQuery,
        **kwargs: Any,
    ) -> VectorStoreQueryResult:
        """Query index for top k most similar nodes."""
        if query.filters is not None:
            if "where" in kwargs:
                raise ValueError(
                    "Cannot specify filter via both query and kwargs. "
                    "Use kwargs only for lancedb specific items that are "
                    "not supported via the generic query interface."
                )
            where = _to_lance_filter(query.filters, self._metadata_keys)
        else:
            where = kwargs.pop("where", None)

        query_type = kwargs.pop("query_type", self.query_type)

        _logger.info("query_type :", query_type)

        if query_type == "vector":
            _query = query.query_embedding
        else:
            if not isinstance(self._table, lancedb.db.LanceTable):
                raise ValueError(
                    "creating FTS index is not supported for LanceDB Cloud yet. "
                    "Please use a local table for FTS/Hybrid search."
                )
            if self._fts_index is None:
                self._fts_index = self._table.create_fts_index(
                    self.text_key, replace=True
                )

            if query_type == "hybrid":
                _query = (query.query_embedding, query.query_str)
            elif query_type == "fts":
                _query = query.query_str
            else:
                raise ValueError(f"Invalid query type: {query_type}")

        lance_query = (
            self._table.search(
                query=_query,
                vector_column_name=self.vector_column_name,
            )
            .limit(query.similarity_top_k * self.overfetch_factor)
            .where(where)
        )

        if query_type != "fts":
            lance_query.nprobes(self.nprobes)
            if query_type == "hybrid" and self._reranker is not None:
                _logger.info(f"using {self._reranker} for reranking results.")
                lance_query.rerank(reranker=self._reranker)

        if self.refine_factor is not None:
            lance_query.refine_factor(self.refine_factor)

        results = lance_query.to_pandas()

        if len(results) == 0:
            raise Warning("query results are empty..")

        nodes = []

        for _, item in results.iterrows():
            try:
                node = metadata_dict_to_node(item.metadata)
                node.embedding = list(item[self.vector_column_name])
            except Exception:
                # deprecated legacy logic for backward compatibility
                _logger.debug(
                    "Failed to parse Node metadata, fallback to legacy logic."
                )
                if item.metadata:
                    metadata, node_info, _relation = legacy_metadata_dict_to_node(
                        item.metadata, text_key=self.text_key
                    )
                else:
                    metadata, node_info = {}, {}
                node = TextNode(
                    text=item[self.text_key] or "",
                    id_=item.id,
                    metadata=metadata,
                    start_char_idx=node_info.get("start", None),
                    end_char_idx=node_info.get("end", None),
                    relationships={
                        NodeRelationship.SOURCE: RelatedNodeInfo(
                            node_id=item[self.doc_id_key]
                        ),
                    },
                )

            nodes.append(node)

        # _logger.info("nodes :", len(nodes))
        print("nodes :", len(nodes)) # this returns the correct no. of nodes as per similarity_top_k

        return VectorStoreQueryResult(
            nodes=nodes,
            similarities=_to_llama_similarities(results),
            ids=results["id"].tolist(),
        )

caes27 commented 4 months ago

Hello @raghavdixit99,

Thank you for helping me, I really appreciate it.

There are a bunch of things that are weird.

I rechunked a smaller set of documents and ingested 3500 nodes into a separate lancedb table. I set similarity_top_k to 1500 and by adding your debugging statement of:

 print("nodes :", len(nodes)) # this returns the correct no. of nodes as per similarity_top_k

It correctly showed 1500 nodes being retuned, but in the final response:

response = index.as_retriever(similarity_top_k = 700).retrieve('test')
print(len(response))

This outputted 1488 nodes, so some nodes were lost in this process. It was kinda fascinating how yours went from 700 to 234. But there is also another issue.

Since there is 3500 documents, I wanted to test it with a larger limit/similarity_top_k.

I set it to 2500 and everytime, both by using:

table_nodes = table.search().limit(2500).to_list()
print(len(table_nodes))

response = index.as_retriever(similarity_top_k = 2500).retrieve('test')
print(len(response))

The top piece of code returned 1510 nodes. For the bottom piece of code, the debugging statement added into the query function showed 1510 nodes, and then it went down to 1498.

The limit/similarity_top_k was set to 2500, so what is going on here? I think this a bigger issue than the nodes being lost in the final stages of the retrieval process?

Tagging for visbility: @logan-markewich

raghavdixit99 commented 4 months ago

@caes27 , a lancedb search : table.search().limit(x) will return the correct result as thats calling our OSS API which is a simple vector search and has been tested without any issues.

Additionally, I locally tested it via len(index.vector_store._table.search().where(None).limit(None).to_pandas()) and got the entire table(768 nodes) which is the correct result, you can refer to our API reference for more details - https://lancedb.github.io/lancedb/python/python/#lancedb.query.LanceQueryBuilder.limit

Perhaps your table has not ingested all the data or your uri needs a refresh (rm -rf /your_lancedb_path).

As for the final retrieval results coming less than expected I have already covered that in my comment and tagged Logan, we should wait for his response as it seems like a parsing problem from the base retriever class.

Thanks

caes27 commented 4 months ago

Hey @raghavdixit99,

I believe you when you say the table.search().limit(x) method works lol

I have refreshed the uri multiple times and same issue. Maybe it's a matter of how nodes are being ingested into the lancedb table when you do this:

vector_store = LanceDBVectorStore(uri="./lancedb_RL_test_3", table_name="docs_8.1", query_type="hybrid")
docstore = SimpleDocumentStore()
docstore.add_documents(all_nodes)
storage_context = StorageContext.from_defaults(vector_store=vector_store, docstore=docstore)
index = VectorStoreIndex(nodes=all_leaf_nodes, storage_context=storage_context)

I can't see anywhere else where it can go wrong.

If you have time, maybe you can try it on your end by populating the table with 2000+ nodes and see if you get the same issue?

Thank you!

caes27 commented 4 months ago

Did more digging. As I was populating the table little by little, instead of sending it 25000+ nodes at once, I realized something.

Suppose my table has 500 nodes in it currently and I want to add 300 more nodes to the table. I run the following code:

vector_store = LanceDBVectorStore(uri='lancedb_TEST', table_name='docs', query_type='hybrid')
docstore = SimpleDocumentStore()
docstore.add_documents(all_nodes)
storage_context = StorageContext.from_defaults(vector_store=vector_store, docstore=docstore)
index = VectorStoreIndex(nodes=all_leaf_nodes, storage_context=storage_context)

After this is done, this should mean there is 800 nodes in the lancedb table, but after I execute the following code:

db3 = lancedb.connect("lancedb_TEST")
table3 = db3.open_table("docs")
vector_store3 = LanceDBVectorStore.from_table(table3)
index3 = VectorStoreIndex.from_vector_store(vector_store3)
index3.insert_nodes(all_leaf_nodes)
retriever3 = index3.as_retriever(similarity_top_k = 1500)
nodes3 = retriever3.retrieve(question)

nodes3 is of length 300, which were the nodes I just added. It ignores the 500 nodes that were in the lancedb table previously.

Is this not the correct way to add nodes to an existing lancedb table? I appreciate any help, thank you!

raghavdixit99 commented 4 months ago

Hi @caes27 Thanks for the update.

Since you are trying to iteratively ingest data you should try changing the mode to “append” by default the table overwrites the data could be the reason for such behavior.

vector_store = LanceDBVectorStore(uri='lancedb_TEST', table_name='docs', query_type='hybrid', mode=“append”)

caes27 commented 4 months ago

Hello @raghavdixit99,

I think I might have found the issue that was causing problems.

First, I noticed some faulty logic in the LanceDBVectorStore's "add" method, and fixed it myself. At the same time, I thought about upgrading the package and this also fixed it lol.

I also tried your solution yesterday, and it works if the table already exists and has some sort of data in it. However, it throws an error when the table is empty. So that is a work around, but this fix in the "add" method seemed to solve ingesting data into a fresh table:

Previous:

if self._table is None:
            self._table = self._connection.create_table(
                self._table_name, data, mode=self.mode
            )
        else:
            if self.api_key is None:
                self._table.add(data, mode=self.mode)
            else:
                self._table.add(data)

After:

if self._table is None:
            self._table = self._connection.create_table(
                self._table_name, data, mode=self.mode
            )
        else:
            if self.api_key is None:
                self._table.add(data, mode="append")
            else:
                self._table.add(data)

From what I saw, the data was ingested in batches and when the second batch came around, because it was in "overwrite" mode, the second batch's data would completely wipe the first batch's data and so on.

The other problem with LlamaIndex losing some nodes in the retrieval process still persists, so I'm still waiting.

Thanks again Raghav for your help throughout this whole thread.

raghavdixit99 commented 4 months ago

Hi @caes27 that is not faulty logic, you are hardcoding the mode argument and we have added that as per user’s requirement/ input. Please follow the usage as per my last comment, rest we are waiting on Logans response.

spearki commented 2 months ago

that is not faulty logic, you are hardcoding the mode argument and we have added that as per user’s requirement/ input

@raghavdixit99 I think the problem @caes27 is pointing out is that "append" is not a valid mode for create_table (I see the below error when the table does not exist yet).

.venv/lib/python3.11/site-packages/lancedb/db.py", line 414, in create_table
    raise ValueError("mode must be either 'create' or 'overwrite'")

The llama index code is using the mode parameter for both create_table and table.add, but the values LanceDB expects for each are different. For create_table, valid modes are "create" or "overwrite", whereas for table.add, the mode must be "overwrite" or "append". This works OK for "overwrite" since the modes overlap, but it doesn't work for "append".

manfredwang093 commented 2 months ago

docstore.add_documents(all_nodes) storage_context = StorageContext.from_defaults(vector_store=vector_store, docstore=docstore) index = VectorStoreIndex(nodes=all_leaf_nodes, storage_context=storage_context)

Hi @caes27 In your example, I saw all_nodes passed to docstore and all_leaf_nodes passed to VectorStoreIndex, is this intended, or this could be the reason of your issue?

manfredwang093 commented 2 months ago

Hello @raghavdixit99,

I think I might have found the issue that was causing problems.

First, I noticed some faulty logic in the LanceDBVectorStore's "add" method, and fixed it myself. At the same time, I thought about upgrading the package and this also fixed it lol.

I also tried your solution yesterday, and it works if the table already exists and has some sort of data in it. However, it throws an error when the table is empty. So that is a work around, but this fix in the "add" method seemed to solve ingesting data into a fresh table:

Previous:
if self._table is None:
            self._table = self._connection.create_table(
                self._table_name, data, mode=self.mode
            )
        else:
            if self.api_key is None:
                self._table.add(data, mode=self.mode)
            else:
                self._table.add(data)
After:
if self._table is None:
            self._table = self._connection.create_table(
                self._table_name, data, mode=self.mode
            )
        else:
            if self.api_key is None:
                self._table.add(data, mode="append")
            else:
                self._table.add(data)
From what I saw, the data was ingested in batches and when the second batch came around, because it was in "overwrite" mode, the second batch's data would completely wipe the first batch's data and so on.

The other problem with LlamaIndex losing some nodes in the retrieval process still persists, so I'm still waiting.

Thanks again Raghav for your help throughout this whole thread.

Hi @logan-markewich @raghavdixit99 @spearki I can confirm this is a bug, run into the same issue when first creating a VectorDB from scratch. What happens is self.mode default to overwrite during initiation. https://github.com/run-llama/llama_index/blob/f7676375ea8d80ce10b92d4e6de73b1bcb77cbc9/llama-index-integrations/vector_stores/llama-index-vector-stores-lancedb/llama_index/vector_stores/lancedb/base.py#L354C1-L355C1

but since data ingested in batch, and latest batch keep overwriting previous, in the end VectorDB will be initiated with only 'input_record_size%insert_batch_size` records

https://github.com/run-llama/llama_index/blob/f7676375ea8d80ce10b92d4e6de73b1bcb77cbc9/llama-index-core/llama_index/core/indices/vector_store/base.py#L74

https://github.com/run-llama/llama_index/blob/f7676375ea8d80ce10b92d4e6de73b1bcb77cbc9/llama-index-core/llama_index/core/indices/vector_store/base.py#L232

Could you kindly provide a patch update to fix this issue? @caes27 's solution solved it for me

logan-markewich commented 1 month ago

It would be great if @caes27 or @manfredwang093 can open a PR for this, and maybe include a unit test :) Tbh there have been several updates in lancedb since this issue was opened, I'm not even sure if this is an issue still