langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
93.08k stars 14.96k forks source link

FAISS vectorstore filter not working #17633

Closed vorel99 closed 1 month ago

vorel99 commented 7 months ago

Checked other resources

Example Code

I tried to run code from langchain doc where is called similarity search with filter, but the results are differend than in documentation.

from langchain_community.vectorstores import FAISS
from langchain.schema import Document

list_of_documents = [
    Document(page_content="foo", metadata=dict(page=1)),
    Document(page_content="bar", metadata=dict(page=1)),
    Document(page_content="foo", metadata=dict(page=2)),
    Document(page_content="barbar", metadata=dict(page=2)),
    Document(page_content="foo", metadata=dict(page=3)),
    Document(page_content="bar burr", metadata=dict(page=3)),
    Document(page_content="foo", metadata=dict(page=4)),
    Document(page_content="bar bruh", metadata=dict(page=4)),
]
db = FAISS.from_documents(list_of_documents, embeddings)

results_with_scores = db.similarity_search_with_score("foo", filter=dict(page=1))
# Or with a callable:
# results_with_scores = db.similarity_search_with_score("foo", filter=lambda d: d["page"] == 1)
for doc, score in results_with_scores:
    print(f"Content: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}")

My results are same as results without filtering.

Content: foo, Metadata: {'page': 1}, Score: 5.159960813797904e-15
Content: foo, Metadata: {'page': 2}, Score: 5.159960813797904e-15
Content: foo, Metadata: {'page': 3}, Score: 5.159960813797904e-15
Content: foo, Metadata: {'page': 4}, Score: 5.159960813797904e-15

Error Message and Stack Trace (if applicable)

No response

Description

I'm migrating from langchain==0.0.349 to new langchain 0.1.X and filtering worked just fine in version 0.0.349

System Info

faiss-cpu==1.7.4 langchain==0.1.6 langchain-community==0.0.19 langchain-core==0.1.23 langchain-openai==0.0.6

windows 10

dosubot[bot] commented 7 months ago

🤖

Thank you for providing detailed information about the issue you're experiencing with the FAISS vectorstore filter in LangChain. It's clear that the filter isn't working as expected in your case.

To better understand the problem and find a solution, I need a bit more information. Could you please provide the following:

  1. The version of Python you're using.
  2. The exact code for how you're generating the embeddings used in FAISS.from_documents(list_of_documents, embeddings).
  3. Any changes you made to the code from the documentation, if any.
  4. The output of pip freeze command to see the exact versions of all the installed packages.

This additional information will help me in diagnosing the issue more accurately. Thank you for your cooperation.

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
vorel99 commented 7 months ago

using python 3.11

exact code:

from langchain_community.vectorstores import FAISS
from langchain.schema import Document
from langchain_openai import AzureOpenAIEmbeddings

embeddings = AzureOpenAIEmbeddings(
    client=None,
    azure_endpoint=config.openai.api_base,
    api_key=config.openai.api_key,
    openai_api_type=config.openai.api_type,
    azure_deployment=config.openai.embedding_deployment_id,
    chunk_size=1,
)

list_of_documents = [
    Document(page_content="foo", metadata=dict(page=1)),
    Document(page_content="bar", metadata=dict(page=1)),
    Document(page_content="foo", metadata=dict(page=2)),
    Document(page_content="barbar", metadata=dict(page=2)),
    Document(page_content="foo", metadata=dict(page=3)),
    Document(page_content="bar burr", metadata=dict(page=3)),
    Document(page_content="foo", metadata=dict(page=4)),
    Document(page_content="bar bruh", metadata=dict(page=4)),
]
db = FAISS.from_documents(list_of_documents, embeddings)

results_with_scores = db.similarity_search_with_score("foo", filter=dict(page=1))
# Or with a callable:
# results_with_scores = db.similarity_search_with_score("foo", filter=lambda d: d["page"] == 1)
for doc, score in results_with_scores:
    print(f"Content: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}")

no changes to code were made

first version of langchain-community with this bug is 0.0.17. langchain-community 0.0.16 works as expected.

pip freeze aiohttp==3.8.5 aiosignal==1.3.1 annotated-types==0.5.0 anyio==3.7.1 asttokens==2.4.0 async-timeout==4.0.3 attrs==23.1.0 azure-core==1.29.7 azure-functions==1.18.0 azure-storage-blob==12.19.0 backcall==0.2.0 bert-score==0.3.13 black==23.9.1 blis==0.7.11 build==1.0.3 catalogue==2.0.10 certifi==2023.7.22 cffi==1.16.0 cfgv==3.4.0 charset-normalizer==3.3.0 click==8.1.7 cloudpathlib==0.15.1 colorama==0.4.6 comm==0.1.4 confection==0.1.3 contourpy==1.1.1 cryptography==41.0.4 cycler==0.12.0 cymem==2.0.8 dataclasses-json==0.6.1 debugpy==1.8.0 decorator==5.1.1 Deprecated==1.2.14 distlib==0.3.7 distro==1.8.0 et-xmlfile==1.1.0 executing==2.0.0 faiss-cpu==1.7.4 filelock==3.12.4 fonttools==4.43.0 frozenlist==1.4.0 fsspec==2023.9.2 greenlet==3.0.0 h11==0.14.0 httpcore==1.0.2 httpx==0.25.2 huggingface-hub==0.17.3 identify==2.5.30 idna==3.4 iniconfig==2.0.0 ipykernel==6.25.2 ipython==8.16.1 isodate==0.6.1 jedi==0.19.1 Jinja2==3.1.2 joblib==1.3.2 jsonpatch==1.33 jsonpointer==2.4 jupyter_client==8.3.1 jupyter_core==5.3.2 kiwisolver==1.4.5 langchain==0.1.6 langchain-community==0.0.19 langchain-core==0.1.23 langchain-openai==0.0.6 langcodes==3.3.0 langsmith==0.0.87 llm-evaluator==1.1.4 lxml==4.9.3 MarkupSafe==2.1.3 marshmallow==3.20.1 matplotlib==3.8.0 matplotlib-inline==0.1.6 mpmath==1.3.0 multidict==6.0.4 murmurhash==1.0.10 mypy-extensions==1.0.0 nest-asyncio==1.5.8 networkx==3.1 nltk==3.8.1 nodeenv==1.8.0 numpy==1.26.0 openai==1.12.0 openpyxl==3.1.2 outcome==1.2.0 packaging==23.2 pandas==2.1.1 parso==0.8.3 pathspec==0.11.2 pathy==0.10.2 pickleshare==0.7.5 Pillow==10.0.1 pip-system-certs==4.0 pip-tools==7.3.0 platformdirs==3.11.0 pluggy==1.3.0 pre-commit==3.4.0 preshed==3.0.9 prompt-toolkit==3.0.39 psutil==5.9.5 pure-eval==0.2.2 pycparser==2.21 pydantic==2.4.2 pydantic-settings==2.1.0 pydantic_core==2.10.1 Pygments==2.16.1 pymssql==2.2.8 PyMuPDF==1.23.6 PyMuPDFb==1.23.6 pyparsing==3.1.1 pyproject_hooks==1.0.0 PySocks==1.7.1 pyspnego==0.10.2 pytest==7.4.2 pytest-asyncio==0.23.4 python-dateutil==2.8.2 python-docx==0.8.11 python-dotenv==1.0.0 pytz==2023.3.post1 pywin32==306 PyYAML==6.0.1 pyzmq==25.1.1 regex==2023.8.8 requests==2.31.0 requests-ntlm==1.2.0 ruff==0.0.292 safetensors==0.3.3.post1 scikit-learn==1.3.1 scipy==1.11.3 selenium==4.13.0 sentence-transformers==2.2.2 sentencepiece==0.1.99 six==1.16.0 smart-open==6.4.0 sniffio==1.3.0 sortedcontainers==2.4.0 spacy==3.7.0 spacy-legacy==3.0.12 spacy-loggers==1.0.5 spacy-udpipe==1.0.0 SQLAlchemy==2.0.21 srsly==2.4.8 sspilib==0.1.0 stack-data==0.6.3 sympy==1.12 tenacity==8.2.3 thinc==8.2.1 threadpoolctl==3.2.0 tiktoken==0.6.0 tokenizers==0.13.3 tomli==2.0.1 torch==2.0.1 torchvision==0.15.2 tornado==6.3.3 tqdm==4.66.1 traitlets==5.10.1 transformers==4.33.3 trio==0.22.2 trio-websocket==0.11.1 typer==0.7.0 typing-inspect==0.9.0 typing_extensions==4.8.0 tzdata==2023.3 ufal.morphodita==1.11.1.1 ufal.udpipe==1.3.0.1 Unidecode==1.3.7 urllib3==2.0.6 virtualenv==20.24.5 wasabi==1.1.2 wcwidth==0.2.8 weasel==0.3.1 wrapt==1.15.0 wsproto==1.2.0 yarl==1.9.2
BMTCompany2 commented 5 months ago

Has this been fixed?

I'm encountering the same problem. Running the same version of langchain

vorel99 commented 5 months ago

Has this been fixed?

I'm encountering the same problem. Running the same version of langchain

Hi, I tried same example with langchain==0.1.17 and its fixed