neuml / txtai

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows
https://neuml.github.io/txtai
Apache License 2.0
9.56k stars 612 forks source link

Unable to save index #419

Closed sridhar-rv closed 1 year ago

sridhar-rv commented 1 year ago

I have the following environment. Python 3.8 txtai Jupyter notebook index backend is default FAISS

The basic example given was tried. Everything works except the step where I want to save the index,

embeddings.save("index").

It runs for more than an hour and never completes. I have to kill kernel every time.

Is there a fix for this issue.

Thanks Sridhar

davidmezzetti commented 1 year ago

Hard to say with these details. When you say basic example, you mean it's saving ~10 records?

sridhar-rv commented 1 year ago

Yes the example given in the documentation.

davidmezzetti commented 1 year ago

I would try running it as a Python script and see if that is any different. This isn't normal, perhaps it's something related to your environment. Do you have something like SELinux or another security policy that could block writes.

Alternatively, you can run the code on Colab or another cloud notebook system like Kaggle to rule out your environment.

sridhar-rv commented 1 year ago

Ok Sure I will try this.

What I see is the index is getting created....but the execution is not getting completed....

sridhar-rv commented 1 year ago

I tried executing the basic example as a python script.

from txtai.embeddings import Embeddings data = ["US tops 5 million confirmed virus cases", "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg", "Beijing mobilises invasion craft along coast as Taiwan tensions escalate", "The National Park Service warns against sacrificing slower friends in a bear attack", "Maine man wins $1M from $25 lottery ticket", "Make huge profits without work, earn up to $100,000 a day"]

embeddings.index([(uid, text, None) for uid, text in enumerate(data)]) embeddings.save("index2")

Scenario 1: embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"}) The script executed the documentation example fine and index got saved. Script completed successfully.

Scenario 2: embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2", "content": True, "objects": True}) The index got saved but script is not completing the execution.

So the content:True , objects:True is causing problem.

davidmezzetti commented 1 year ago

What OS are you running on? Objects shouldn't make a difference since there isn't binary content. There could be an issue with content and the version of SQLite available.

sridhar-rv commented 1 year ago

UBUNTU 20.04.4 LTS (FOCAL FOSSA) is the OS version. SQLite3 version : 3.38.5 Python 3.8.5 txtai 5.2.0

davidmezzetti commented 1 year ago

Not sure on this. Nothing seems out of the ordinary, lots of instances running on that platform without issues, including the automated GitHub Actions builds.

The latest version of Python 3.8 is 3.8.16. Did you apt-get update and apt-get install to update to the latest version of files for your OS? Best guess would be that something is blocking the write but it's hard to say.

davidmezzetti commented 1 year ago

Closing due to inactivity. Re-open or open a new issue if this still persists.