run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.92k stars 5.29k forks source link

[Bug]: Unable to change MilvusVectorStore uri parameter to a local Milvus path #17046

Open jayant-yadav opened 23 hours ago

jayant-yadav commented 23 hours ago

Bug Description

MilvusVectorStore takes in uri parameter which when given as a local Milvus path (which works on MilvusLite), does not change its path and sticks to ./milvus_llamaindex.db as a default.

Unable to change the uri has implication on other parts of the code when the vector store needs to be fetched again in a different session after all the docs have been inserted.

Version

v0.12.1

Steps to Reproduce

Consider the following code from Milvus demo example, with the only change in LLM and embedding model used to be from HF instead of OpenAI.:

%pip install llama-index-vector-stores-milvus
%pip install llama-index
%pip install pymilvus>=2.4.2
%pip install llama-index-llms-huggingface-api
!pip install "huggingface_hub[inference]"
%pip install llama-index-embeddings-huggingface

! mkdir -p 'data/'
! wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham_essay.txt'
! wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf' -O 'data/uber_2021.pdf'

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
from llama_index.core import Settings
from google.colab import userdata

# set embed model for index and quering
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
Settings.embed_model = embed_model

llm = HuggingFaceInferenceAPI(
model_name="meta-llama/Meta-Llama-3-8B-Instruct",
token = userdata.get('HF_TOKEN')
)
Settings.llm = llm

from llama_index.core import SimpleDirectoryReader

# load documents
documents = SimpleDirectoryReader(
    input_files=["./data/paul_graham_essay.txt"]
).load_data()

print("Document ID:", documents[0].doc_id)

# Create an index over the documents
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.milvus import MilvusVectorStore

vector_store = MilvusVectorStore(
    uri="./milvus_something.db", dim=384, overwrite=True
)
print(vector_store.uri) 
#prints ./milvus_llamaindex.db instead of ./milvus_something.db

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

query_engine = index.as_query_engine()
res = query_engine.query("What did the author learn?")
print(res)

Relevant Logs/Tracbacks

--2024-11-23 11:49:00--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham_essay.txt’

data/paul_graham_es 100%[===================>]  73.28K  --.-KB/s    in 0.002s  

2024-11-23 11:49:00 (41.1 MB/s) - ‘data/paul_graham_essay.txt’ saved [75042/75042]

--2024-11-23 11:49:00--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1880483 (1.8M) [application/octet-stream]
Saving to: ‘data/uber_2021.pdf’

data/uber_2021.pdf  100%[===================>]   1.79M  --.-KB/s    in 0.01s   

2024-11-23 11:49:01 (188 MB/s) - ‘data/uber_2021.pdf’ saved [1880483/1880483]

DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: 3387021e2381447792762a5d40b37cea
DEBUG:pymilvus.milvus_client.milvus_client:Successfully created collection: llamacollection
Document ID: 83b63d52-3e07-40e0-94ed-d9527ef70414
DEBUG:pymilvus.milvus_client.milvus_client:Successfully created an index on collection: llamacollection
./milvus_llamaindex.db
1. AI, as practiced at the time, was a hoax. 2. There was an unbridgeable gap between what AI programs could do and actually understanding natural language. 3. Lisp was interesting for its own sake and not just for its association with AI. 4. The author could salvage Lisp from the wreckage of his plans. 5. The author could write essays and publish them online, which was a new medium. 6. Working on things that weren't prestigious could be a sign of something real to be discovered and the right kind of motives. 7. The author could learn from his experiences and write about them in essays. 8. The author could use the web to publish his essays and reach a wider audience. 9. The author could work on different projects, such as spam filters, painting, and cooking. 10. The author could use his experiences to teach others and share his knowledge. 11. The author could use his writing to express himself and share his thoughts with others. 12. The author could use his experiences to learn and grow as a person. 13. The author could use his writing to inspire others and share his passion with them. 14. The author could use his experiences to create
dosubot[bot] commented 23 hours ago

The issue you're experiencing with the MilvusVectorStore class not updating the uri parameter to a local Milvus path seems to be a bug. The uri parameter should allow you to specify a different path, such as ./milvus_something.db, but it appears to default back to ./milvus_llamaindex.db instead.

The expected behavior is that the uri parameter should be set to the specified path, allowing you to use a local Milvus Lite database file. This is crucial for ensuring that the vector store can be accessed consistently across different sessions.

To address this, you might want to check if there are any recent updates or patches for the MilvusVectorStore class in the repository that could resolve this issue. If the problem persists, it may require a code change or a workaround until an official fix is provided.

To continue talking to Dosu, mention @dosu.

logan-markewich commented 21 hours ago

Looking at the latest source code, it's directly passing the uri to the milvus client https://github.com/run-llama/llama_index/blob/b452e003c2fc0684e650022a720bde193162e37b/llama-index-integrations/vector_stores/llama-index-vector-stores-milvus/llama_index/vector_stores/milvus/base.py#L273

jayant-yadav commented 19 hours ago

Is there a motivation behind using self.client at places and self._milvusclient in other? Although it should not matter, but just curious:

@property
def client(self) -> Any:
    """Get client."""
    return self._milvusclient

https://github.com/run-llama/llama_index/blob/b452e003c2fc0684e650022a720bde193162e37b/llama-index-integrations/vector_stores/llama-index-vector-stores-milvus/llama_index/vector_stores/milvus/base.py#L273 https://github.com/run-llama/llama_index/blob/b452e003c2fc0684e650022a720bde193162e37b/llama-index-integrations/vector_stores/llama-index-vector-stores-milvus/llama_index/vector_stores/milvus/base.py#L283

I would also like to add that the bug that I encounter with a wrong uri being returned, does not mean that the ./milvus_something.db was not created. It was. But rather the vector_store was pointing to a wong (./milvus_llamaindex.db in this case) uri instead.

logan-markewich commented 18 hours ago

If it creates the proper db, it's a pretty mild bug I guess?

Could be related to setting the field on the class directly as a string vs. A pydantic field object. Will take a look

jayant-yadav commented 17 hours ago

Note that collection_name, even though initialized the same way as uri in the __init__, can be overrided by the user input:

https://github.com/run-llama/llama_index/blob/b452e003c2fc0684e650022a720bde193162e37b/llama-index-integrations/vector_stores/llama-index-vector-stores-milvus/llama_index/vector_stores/milvus/base.py#L188C5-L188C20.

output of the test code:

stores_text=True is_embedding_query=True stores_node=True uri='./milvus_llamaindex.db' token='' collection_name='some_collection' dim=384 embedding_field='embedding' doc_id_field='doc_id' similarity_metric='IP' consistency_level='Session' overwrite=True text_key=None output_fields=[] index_config={} search_config={} collection_properties=None batch_size=100 enable_sparse=False sparse_embedding_field='sparse_embedding' sparse_embedding_function=None hybrid_ranker='RRFRanker' hybrid_ranker_params={} index_management=<IndexManagement.CREATE_IF_NOT_EXISTS: 'create_if_not_exists'> scalar_field_names=None scalar_field_types=None