langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
93.05k stars 14.95k forks source link

retriever.as_retriever() function cannot retrieve data based on filter from azure cognitive search but #19885

Closed Farid-Ullah closed 2 months ago

Farid-Ullah commented 6 months ago

Checked other resources

Example Code

python... here is the code that i use: i ahve index data on azure cognitive search and each chunk has searchable type of metadata which is location if i use the use acs.as_retriever() function along with filter it retrieve different location data as well like you can see in below code output is have print each retrieve doc location metadata.

but i use the acs.similarity_search() and inside we pass filter it will only retrieve that location data and not retrieve mix location data.

acs = acs_search("testindex")
retriever = acs.as_retriever(search_kwargs={'filter': {'location':'US'},
                                            'k': 5})

def format_docs(docs):
    for i in docs:
        print(i.metadata["location"])
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain.invoke("What is hr policy about leave")

OUTPUT

US
PK
MY
US
MY
'The HR policy about leave at xyz includes standard paid leave for full-time employees after 90 days of continuous employment. This includes Annual Leave (AL) of 14 workdays....

USE acs.similarity_search()

res = acs.similarity_search(
    query="what is the hr policy for anual leave", k=4, search_type="hybrid", filters="location eq 'US'"
)
res

OUTPUT:

[Document(page_content='Leave taken under this policy does, metadata={'source': '2023-us.pdf', 'location': 'US'}),
 Document(page_content='You may use available vacation, pers metadata={'source': '2023-us.pdf', 'location': 'US'}),
 Document(page_content="Failure to Return to Work If you fa", metadata={'source': '2023-us.pdf', 'location': 'US'}),
 Document(page_content='To request leave under this policy, , metadata={'source': '2023-us.pdf', 'location': 'US'})]

you can see this function give exact filter data and not mixed data .

what would be the solution because we are use the first function inside chain and we are unable to get filter data.

Error Message and Stack Trace (if applicable)

inside langchain_core > vectorstores.py i have place this print but the filter did not work:

def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        if self.search_type == "similarity":
            print("===filter=======>\n",self.search_kwargs,"\n=============")
            docs = self.vectorstore.similarity_search(query, **self.search_kwargs)

OUTPUT:

===filter=======>
 {'filter': {'location': 'US'}, 'k': 5} 
=============

we are unable to get filter data while using as_retriever() function inside chain the doc return by this is given in first code output

Description

i use the below versions

langchain==0.1.8
langchain-community==0.0.21
langchain-core==0.1.25
langchain-openai==0.0.6

System Info

aiohttp==3.9.3
aiosignal==1.3.1
annotated-types==0.6.0
antlr4-python3-runtime==4.9.3
anyio==4.3.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asgiref==3.7.2
asttokens==2.4.1
async-lru==2.0.4
async-timeout==4.0.3
asyncio-redis==0.16.0
attrs==23.2.0
azure-common==1.1.28
azure-core==1.30.0
azure-identity==1.15.0
azure-monitor-opentelemetry-exporter==1.0.0b22
azure-search-documents==11.4.0
azure-storage-blob==12.19.1
Babel==2.14.0
backoff==2.2.1
beautifulsoup4==4.12.3
bleach==6.1.0
cachetools==5.3.3
certifi==2024.2.2
cffi==1.16.0
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.7
cohere==4.56
coloredlogs==15.0.1
comm==0.2.1
contourpy==1.2.0
cryptography==42.0.4
cycler==0.12.1
dataclasses-json==0.6.4
debugpy==1.8.1
decorator==5.1.1
deepdiff==6.7.1
defusedxml==0.7.1
Deprecated==1.2.14
distro==1.9.0
effdet==0.4.1
emoji==2.10.1
et-xmlfile==1.1.0
exceptiongroup==1.2.0
executing==2.0.1
fastapi==0.109.2
fastavro==1.9.4
fastjsonschema==2.19.1
filelock==3.13.1
filetype==1.2.0
fixedint==0.1.6
flatbuffers==24.3.6
fonttools==4.49.0
fqdn==1.5.1
frozenlist==1.4.1
fsspec==2024.2.0
greenlet==3.0.3
h11==0.14.0
httpcore==1.0.4
httpx==0.27.0
huggingface-hub==0.21.4
humanfriendly==10.0
idna==3.6
importlib-metadata==6.11.0
iopath==0.1.10
ipykernel==6.29.2
ipython==8.22.1
ipywidgets==8.1.2
isodate==0.6.1
isoduration==20.11.0
jedi==0.19.1
Jinja2==3.1.3
joblib==1.3.2
json5==0.9.24
jsonpatch==1.33
jsonpath-python==1.0.6
jsonpointer==2.4
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-events==0.10.0
jupyter-lsp==2.2.4
jupyter_client==8.6.0
jupyter_core==5.7.1
jupyter_server==2.13.0
jupyter_server_terminals==0.5.3
jupyterlab==4.1.5
jupyterlab_pygments==0.3.0
jupyterlab_server==2.25.4
jupyterlab_widgets==3.0.10
kiwisolver==1.4.5
langchain==0.1.8
langchain-community==0.0.21
langchain-core==0.1.25
langchain-openai==0.0.6
langchainhub==0.1.15
langdetect==1.0.9
langsmith==0.1.5
layoutparser==0.3.4
lxml==5.1.0
MarkupSafe==2.1.5
marshmallow==3.20.2
matplotlib==3.8.3
matplotlib-inline==0.1.6
mistune==3.0.2
mpmath==1.3.0
msal==1.26.0
msal-extensions==1.1.0
msrest==0.7.1
multidict==6.0.5
mypy-extensions==1.0.0
nbclient==0.10.0
nbconvert==7.16.3
nbformat==5.10.3
nest-asyncio==1.6.0
networkx==3.2.1
nltk==3.8.1
notebook==7.1.2
notebook_shim==0.2.4
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.19.3
nvidia-nvjitlink-cu12==12.4.99
nvidia-nvtx-cu12==12.1.105
oauthlib==3.2.2
omegaconf==2.3.0
onnx==1.15.0
onnxruntime==1.15.1
openai==1.12.0
opencv-python==4.9.0.80
openpyxl==3.1.2
opentelemetry-api==1.22.0
opentelemetry-instrumentation==0.43b0
opentelemetry-instrumentation-asgi==0.43b0
opentelemetry-instrumentation-fastapi==0.43b0
opentelemetry-sdk==1.22.0
opentelemetry-semantic-conventions==0.43b0
opentelemetry-util-http==0.43b0
ordered-set==4.1.0
overrides==7.7.0
packaging==23.2
pandas==2.2.1
pandocfilters==1.5.1
parso==0.8.3
pdf2image==1.17.0
pdfminer.six==20221105
pdfplumber==0.10.4
pexpect==4.9.0
pikepdf==8.13.0
pillow==10.2.0
pillow_heif==0.15.0
platformdirs==4.2.0
portalocker==2.8.2
prometheus_client==0.20.0
prompt-toolkit==3.0.43
protobuf==4.25.3
psutil==5.9.8
ptyprocess==0.7.0
pure-eval==0.2.2
pycocotools==2.0.7
pycparser==2.21
pydantic==2.6.1
pydantic-settings==2.2.0
pydantic_core==2.16.2
Pygments==2.17.2
PyJWT==2.8.0
pymssql==2.2.11
pyparsing==3.1.2
pypdf==4.1.0
pypdfium2==4.27.0
pytesseract==0.3.10
python-dateutil==2.8.2
python-docx==1.1.0
python-dotenv==1.0.1
python-iso639==2024.2.7
python-json-logger==2.0.7
python-magic==0.4.27
python-multipart==0.0.9
pytz==2024.1
PyYAML==6.0.1
pyzmq==25.1.2
qtconsole==5.5.1
QtPy==2.4.1
rapidfuzz==3.6.2
redis==5.0.1
referencing==0.34.0
regex==2023.12.25
requests==2.31.0
requests-oauthlib==1.3.1
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rpds-py==0.18.0
safetensors==0.4.2
scipy==1.12.0
Send2Trash==1.8.2
six==1.16.0
sniffio==1.3.0
soupsieve==2.5
SQLAlchemy==2.0.27
stack-data==0.6.3
starlette==0.36.3
sympy==1.12
tabulate==0.9.0
tenacity==8.2.3
terminado==0.18.1
tiktoken==0.6.0
timm==0.9.16
tinycss2==1.2.1
tokenizers==0.15.2
tomli==2.0.1
torch==2.2.1
torchvision==0.17.1
tornado==6.4
tqdm==4.66.2
traitlets==5.14.1
transformers==4.38.2
triton==2.2.0
types-python-dateutil==2.9.0.20240316
types-requests==2.31.0.20240311
typing-inspect==0.9.0
typing_extensions==4.9.0
tzdata==2024.1
unstructured==0.12.4
unstructured-client==0.21.1
unstructured-inference==0.7.23
unstructured.pytesseract==0.3.12
uri-template==1.3.0
urllib3==2.2.1
uvicorn==0.27.1
wcwidth==0.2.13
webcolors==1.13
webencodings==0.5.1
websocket-client==1.7.0
widgetsnbextension==4.0.10
wrapt==1.16.0
xlrd==2.0.1
yarl==1.9.4
zipp==3.17.0
liugddx commented 6 months ago

Let me see

Farid-Ullah commented 6 months ago

Hi @liugddx , Have you checked the issue? Thank

Farid-Ullah commented 6 months ago

Hi @jarib @zeke , Hope you all doing well.

Could you help me sort out this problem sloution because if it did not work in chain then i will do it customly step by step to acheive this functionality.

Your help would be appreciated. thank you

sbusso commented 6 months ago

@Farid-Ullah, no random tagging, please.

my23701 commented 1 month ago

Hi @Farid-Ullah, did you get the solution for this problem? I am facing similar problem in my RAG also.