neuml / txtai

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows
https://neuml.github.io/txtai
Apache License 2.0
9.4k stars 603 forks source link

Embeddings.search leads to segfault error #813

Closed Pringled closed 2 hours ago

Pringled commented 5 hours ago

Hi! When running one of the examples, I ran into an issue.

Issue

The following code crashes with a segfault error when search is called:

UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
from txtai import Embeddings

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings(path="sentence-transformers/nli-mpnet-base-v2")

data = [
  "US tops 5 million confirmed virus cases",
  "Canada's last fully intact ice shelf has suddenly collapsed, " +
  "forming a Manhattan-sized iceberg",
  "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
  "The National Park Service warns against sacrificing slower friends " +
  "in a bear attack",
  "Maine man wins $1M from $25 lottery ticket",
  "Make huge profits without work, earn up to $100,000 a day"
]

# Index the list of text
embeddings.index(data)

print(f"{'Query':20} Best Match")
print("-" * 50)

# Run an embeddings search for each query
for query in ("feel good story", "climate change", "public health story", "war",
              "wildlife", "asia", "lucky", "dishonest junk"):
    # Extract uid of first result
    # search result format: (uid, score)
    uid = embeddings.search(query, 1)[0][0]

    # Print text
    print(f"{query:20} {data[uid]}")

Environment info

Running on MacOS, M3, python version=3.10.14. Venv:

aiohappyeyeballs==2.4.3
aiohttp==3.11.4
aiosignal==1.3.1
annotated-types==0.7.0
anyio==4.6.2.post1
async-timeout==5.0.1
attrs==24.2.0
certifi==2024.8.30
charset-normalizer==3.4.0
click==8.1.7
diskcache==5.6.3
distro==1.9.0
exceptiongroup==1.2.2
faiss-cpu==1.9.0
fasteners==0.19
fasttext==0.9.3
filelock==3.16.1
frozenlist==1.5.0
fsspec==2024.10.0
h11==0.14.0
httpcore==1.0.7
httpx==0.27.2
huggingface-hub==0.26.2
idna==3.10
importlib_metadata==8.5.0
Jinja2==3.1.4
jiter==0.7.1
joblib==1.4.2
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
litellm==1.52.10
llama_cpp_python==0.3.2
lz4==4.3.3
markdown-it-py==3.0.0
MarkupSafe==3.0.2
mdurl==0.1.2
model2vec==0.3.2
mpmath==1.3.0
msgpack==1.1.0
multidict==6.1.0
networkx==3.4.2
numpy==2.1.3
openai==1.54.4
packaging==24.2
pillow==11.0.0
propcache==0.2.0
pybind11==2.13.6
pydantic==2.9.2
pydantic_core==2.23.4
Pygments==2.18.0
pymagnitude-lite==0.1.143
python-dotenv==1.0.1
PyYAML==6.0.2
referencing==0.35.1
regex==2024.11.6
requests==2.32.3
rich==13.9.4
rpds-py==0.21.0
safetensors==0.4.5
scikit-learn==1.5.2
scipy==1.14.1
sentence-transformers==3.3.1
skops==0.10.0
sniffio==1.3.1
sympy==1.13.1
tabulate==0.9.0
threadpoolctl==3.5.0
tiktoken==0.8.0
tokenizers==0.20.3
torch==2.5.1
tqdm==4.67.0
transformers==4.46.3
txtai==8.0.0
typing_extensions==4.12.2
urllib3==2.2.3
xxhash==3.5.0
yarl==1.17.2
zipp==3.21.0
davidmezzetti commented 2 hours ago

Hello, thank you for the detailed report.

This is typically due to a known issue between Faiss and macOS (https://github.com/kyamagu/faiss-wheels/issues/100)

The usual mitigations are:

Issue

Segmentation faults and similar errors on macOS

Solution

Set the following environment parameters.

Source: https://neuml.github.io/txtai/faq/

There is also this: https://github.com/kyamagu/faiss-wheels/issues/73#issuecomment-1913995571

export KMP_DUPLICATE_LIB_OK=TRUE

It would be great to have a programmatic solution as I'm sure there are plenty of macOS users that encounter this error and just move on to another library.

Pringled commented 2 hours ago

Hi @davidmezzetti, thanks for the detailed reply! The other backends indeed seem to work fine. I guess an alternative solution would be to have a different default for the index method (e.g. hnsw), but I guess that's not as nice as it would introduce more base dependencies for txtai. For now I'll just use a different backend as I would be using hnsw from faiss anyway, thanks!

davidmezzetti commented 1 hour ago

In the past, I had setup.py conditionally install hnswlib for mac/windows and faiss for linux as the defaults. But that became confusing as the results were different based on the OS.

I've been hoping the upstream library would find a solution but I've been holding my breath for a while :smile: