langchain-ai / langchain

πŸ¦œπŸ”— Build context-aware reasoning applications
https://python.langchain.com
MIT License
93.7k stars 15.11k forks source link

KayAiRetriever: without metadata filtering, wrong results #12008

Closed olgadk7 closed 12 months ago

olgadk7 commented 1 year ago

System Info

Hello! Even though API mentions a metadata param, it's not found in code.

Without metadata filtering, querying β€œTell me about the returns of Palantir Technologies Inc.” returns docs with 'company_name': 'ETSY INC'.

Thank you

Who can help?

@eyurtsev?

Information

Related Components

Reproduction

from langchain.retrievers import KayAiRetriever retriever = KayAiRetriever.create(dataset_id="company", data_types=["10-K"], num_contexts=1) retriever.get_relevant_documents(β€œTell me about the returns of Palantir Technologies Inc.?")

Expected behavior

[Document(page_content="Company Name: ETSY INC \n Company Industry: SERVICES-BUSINESS SERVICES, NEC \n Form Title: 10-K 2020-FY \n Form Section: Risk Factors \n Text: Any events causing ... abroad.", metadata={'chunk_type': 'text', 'chunk_years_mentioned': [], 'company_name': 'PALANTIR TECHNOLOGIES INC', 'company_sic_code_description': 'SERVICES-BUSINESS SERVICES, NEC', 'data_source': '10-K', 'data_source_link': 'https://www.sec.gov/Archives/edgar/data/1370637/000137063721000012', 'data_source_publish_date': '2020-01-01T00:00:00Z', 'data_source_uid': '0001370637-21-000012', 'title': 'ETSY INC | 10-K 2020-FY '})]

dosubot[bot] commented 1 year ago

πŸ€–

Hello,

Thank you for bringing this to our attention. You're correct that the current implementation of the KayAiRetriever does not support metadata filtering. This is because the query method of the KayRetriever client, which is used in the _get_relevant_documents method, does not have a parameter for metadata. It only takes in the query and num_context parameters.

Here's the relevant code snippet from libs/langchain/langchain/retrievers/kay.py:

def _get_relevant_documents(
    self, query: str, *, run_manager: CallbackManagerForRetrieverRun
) -> List[Document]:
    ctxs = self.client.query(query=query, num_context=self.num_contexts)
    docs = []
    for ctx in ctxs:
        page_content = ctx.pop("chunk_embed_text", None)
        if page_content is None:
            continue
        docs.append(Document(page_content=page_content, metadata={**ctx}))
    return docs

As you can see, the query method retrieves documents based on the query string and the number of contexts specified, without considering any metadata. The metadata of each document is simply added to the document after retrieval.

If you want to filter documents by metadata, you would need to modify the KayAiRetriever class to include this functionality. This could involve adding a metadata parameter to the query method of the KayRetriever client, or filtering the retrieved documents by metadata within the _get_relevant_documents method of the KayAiRetriever class. However, this would require changes to the Kay.ai Python package, which may not be possible if you do not have control over this package.

We understand the importance of this feature and will consider adding it in future updates. In the meantime, you could implement a workaround by filtering the returned documents based on their metadata after they have been retrieved.

I hope this helps! If you have any other questions or need further clarification, please don't hesitate to ask.

Best, Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

plv commented 1 year ago

Hi @olgadk7 , we had an issue on our backend at Kay.ai. Could you give it another try? It should be fixed now -- I retried your query and got an expected result.

>>> from langchain.retrievers import KayAiRetriever

>>> retriever = KayAiRetriever.create(dataset_id="company", data_types=["10-K"], num_contexts=1)

>>> retriever.get_relevant_documents("Tell me about the returns of Palantir Technologies Inc.?")

[Document(page_content='Company Name: PALANTIR TECHNOLOGIES INC \n Company Industry: SERVICES-PREPACKAGED SOFTWARE \n Form Title: 10-K 2022-FY \n Form Section: Risk Factors \n Text: Add: employer payroll taxes related to stock based compensation 17,156 106,283 Adjusted income from operations $ 420,753 $ 473,452 68 Components of Results of Operations Revenue We generate revenue from the sale of subscriptions to access our software in our hosted environment along with ongoing O&M services ("Palantir Cloud"), software subscriptions in our customers\' environments with ongoing O&M services ("On Premises Software"), and professional services.Palantir Cloud Our Palantir Cloud subscriptions grant customers the right to access the software functionality in a hosted environment controlled by Palantir and are sold together with stand ready O&M services, as further described below.We promise to provide continuous access to the hosted software throughout the contract term.Revenue associated with Palantir Cloud subscriptions is generally recognized over the contract term on a ratable basis, which is consistent with the transfer of control of the Palantir services to the customer.On Premises Software Sales of our software subscriptions grant customers the right to use functional intellectual property, either on their internal hardware infrastructure or on their own cloud instance, over the contractual term and are also sold together with stand ready O&M services.', metadata={'chunk_type': 'text', 'chunk_years_mentioned': [], 'company_name': 'PALANTIR TECHNOLOGIES INC', 'company_sic_code_description': 'SERVICES-PREPACKAGED SOFTWARE', 'data_source': '10-K', 'data_source_link': 'https://www.sec.gov/Archives/edgar/data/1321655/000132165523000011', 'data_source_publish_date': '2022-01-01T00:00:00Z', 'data_source_uid': '0001321655-23-000011', 'title': 'PALANTIR TECHNOLOGIES INC |  10-K 2022-FY '})]
olgadk7 commented 1 year ago

Hi @plv, thank you.

Yes, it works for Palantir now. Unfortunately, when i do the same for SHOPIFY INC, with num_contexts=7, 6 of the docs have 'company_name': 'ETSY INC', and 1, 'COTY INC'.

And filtering the returned documents based on the metadata after they have been retrieved is not really an economical solution, as suggested by @dosu-bot.

dosubot[bot] commented 1 year ago

πŸ€–

Hello,

Thank you for reaching out with your issue. I understand that you're having trouble with the KayAiRetriever class and would like to add metadata filtering to it. Based on your description, I suggest modifying the _get_relevant_documents method in the KayAiRetriever class to include a company name parameter. This parameter can then be used to filter the documents returned by the self.client.query call.

Here's a sample modification you can make:

def _get_relevant_documents(
    self, query: str, company_name: str, *, run_manager: CallbackManagerForRetrieverRun
) -> List[Document]:
    ctxs = self.client.query(query=query, num_context=self.num_contexts)
    docs = []
    for ctx in ctxs:
        if ctx.get('company_name') != company_name:
            continue
        page_content = ctx.pop("chunk_embed_text", None)
        if page_content is None:
            continue
        docs.append(Document(page_content=page_content, metadata={**ctx}))
    return docs

In this code, ctx.get('company_name') is used to get the company name from the metadata of each document. If the company name does not match the specified company name, the document is skipped. Please replace 'company_name' with the actual key used in your metadata to store the company name.

Please note that this modification assumes that the metadata of each document includes the company name. If this is not the case, you will need to modify the code accordingly.

As for the structure of the metadata returned by the self.client.query call, it's a dictionary with keys and values that were in the original context, excluding the "chunk_embed_text" key. The exact keys and values would depend on the specific data returned by the self.client.query call.

I hope this helps! If you have any other questions or need further clarification, please don't hesitate to ask.

Best, Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

plv commented 1 year ago

@olgadk7 I believe we are behaving correctly -- $SHOP is a foreign company and does not file 10-K or 10-Q filings. See the EDGAR results here: https://www.sec.gov/edgar/search/#/ciks=0001594805&entityName=SHOPIFY%2520INC.%2520(SHOP)%2520(CIK%25200001594805)&page=2

We're working on adding data for other filing types. If you want I can let you know when that's available. Given that Shopify is a Canadian company, the filing type we will need is a 40-F form

olgadk7 commented 1 year ago

good catch, but why does it return ETSY, instead of error?

and, what are you plans / timeline for metadata filtering (eg by 'company_name'), and ordering (eg by 'data_source_publish_date')?

ty!

plv commented 12 months ago

@olgadk7 Sorry for the late reply, was out on Friday and missed your post :)

good catch, but why does it return ETSY, instead of error?

Kind of how if you type in random nonsense into Google, you still somehow get results. We attempt a "best effort" search. It is more of "neural search" and less "rigid database".

and, what are you plans / timeline for metadata filtering (eg by 'company_name'), and ordering (eg by 'data_source_publish_date')?

Right now we actually do smart filtering automatically. If you mention a company name that exists in our DB (i.e. they have filed a 10k/10q) then we filter on that company name behind the scenes.

Would you prefer explicit filtering support in the API?

P.S. If you have any more questions feel free to reach out to our support email -- hello@kay.ai :)

olgadk7 commented 12 months ago

I would prefer explicit filtering. Thank you, closing!