[Question]: LlamaIndex retriever on Confluence pages

hz6yc3 commented 7 hours ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

I am using the LlamaIndex provided Confluence Reader to ingest confluence pages. I am using the SemanticSplitterNodeParser to parse the confluence pages and generate nodes and indexing them in the vector store which works fine. I have created a context chat engine using the index as retriever (tried the simple as_retriever(), RecursiveRetriever(), etc.)

The confluence pages are structured in such a way that there is a unique ticket number in the page title (page heading) and the contents of the page provide description about the ticket, when the ticket was created and other details about it. The page content has no mention of the ticket number and is present only on the page title. The problem I am facing is that when the user is asking a question with reference to the ticket number (like - Can you summarize the ticket XYZ?) which is present only on the page title, the retrievers are consistently behaving poorly and do not retrieve the related document chunks along the page title is present in the node metadata/document metadata. I tried various methods like using DocumentSummaryIndex to summarize the page and creating a retriever on top of it but none of them seem to be working as expected in this case. I even tried to copy the page title into the document text but still no luck.

The retrievers are behaving very weird and not pulling the context correctly. No other post processing technique like using re-rankers, etc. would help because the retrieval itself is bad and doesn't pull the context. Any suggestions on this issue?

@dosu

dosubot[bot] commented 7 hours ago

I found a similar issue that might be relevant to your situation. In a closed issue titled "[Question]: How does llama-index using metadata for query?" it was suggested to modify the _retrieve method in the DocumentSummaryIndexLLMRetriever class to improve recall by extending the keywords list with the 'summary' in metadata. This involves including the 'summary' from the metadata in the summary_nodes, which could potentially help in cases where important information like ticket numbers is only present in the metadata ^[1].

Additionally, you might want to consider using a hybrid approach by combining BM25 with the vector index, as suggested in another issue related to poor performance of vector retrievers on keyword queries. This approach can sometimes yield better results for complex queries ^[2].

To continue talking to Dosu, mention @dosu.

hz6yc3 commented 2 hours ago

Thanks! I have dates embedded within the context or the page content. How can I use a filter condition on the date to pull only relevant chunks @dosu

dosubot[bot] commented 1 hour ago

You can use the MetadataFilters and MetadataFilter classes in LlamaIndex to apply filter conditions on date fields. Here's an example of how to filter document chunks based on a date range:

from llama_index.core.vector_stores.types import MetadataFilter, MetadataFilters

filters = MetadataFilters(
    filters=[
        MetadataFilter(key="commit_date", value="2023-08-15", operator=">="),
        MetadataFilter(key="commit_date", value="2023-08-25", operator="<="),
    ],
    condition="and",
)

retriever = index.as_retriever(
    similarity_top_k=10,
    filters=filters,
)

retrieved_nodes = retriever.retrieve("What is this software project about?")

for node in retrieved_nodes:
    print(node.node.metadata)

In this example, the MetadataFilters object is used to filter nodes based on the commit_date field, retrieving only those nodes with commit dates between "2023-08-15" and "2023-08-25" ^[1].

To continue talking to Dosu, mention @dosu.

hz6yc3 commented 1 hour ago

I am referring to dates embedded within the text or content of the page. I want to filter the chunks by matching the dates or do conditional filtering based on dates embedded within the text @dosu

hz6yc3 commented 1 hour ago

@dosu

dosubot[bot] commented 1 hour ago

To filter chunks based on dates embedded within the text content, you can use LlamaIndex's natural language processing capabilities to extract date entities and then apply filters. Here's how you can achieve this:

Extract Date Entities: Use LlamaIndex's built-in support for entity extraction to identify date entities within your text content. This involves configuring the LLM to extract structured data, including dates.
Apply Filters on Extracted Dates: Once the dates are extracted and stored as metadata, you can use MetadataFilters to filter the document chunks based on these date entities.

Here's a conceptual example:

from llama_index import LLM
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters

# Initialize the LLM and extract structured data
llm = LLM()
structured_data = llm.extract_structured_data(text_content)

# Assume 'dates' is a list of extracted date entities
dates = structured_data.get('dates')

# Add extracted dates to metadata
for node in nodes:
    node.metadata['extracted_dates'] = dates

# Apply filters based on extracted dates
filters = MetadataFilters(
    filters=[
        MetadataFilter(key="extracted_dates", value="2023-08-15", operator=">="),
        MetadataFilter(key="extracted_dates", value="2023-08-25", operator="<="),
    ],
    condition="and",
)

retriever = index.as_retriever(filters=filters)
retrieved_nodes = retriever.retrieve("What is this software project about?")

for node in retrieved_nodes:
    print(node.node.metadata)

This approach allows you to dynamically filter content based on dates extracted from the text, leveraging LlamaIndex's NLP capabilities ^[1].

To continue talking to Dosu, mention @dosu.

hz6yc3 commented 31 minutes ago

@dosu This is not working. I am not able to find the related libraries for extracting structured data from text

dosubot[bot] commented 29 minutes ago

To extract structured data from text using LlamaIndex, you need the following libraries:

llama-index: The main library for LlamaIndex.
unstructured[msg]: For handling unstructured data, required for .eml and .msg formats.
llama-index-program-openai: For integrating with OpenAI's API.
llama-index-llms-llama-api: For using the Llama API.
llama-index-readers-structured-data: For reading structured data from JSON, JSONL, CSV, and XLSX formats.

You can install these libraries using the following commands:

!pip install llama-index
!pip install "unstructured[msg]"
!pip install llama-index-program-openai
!pip install llama-index-llms-llama-api
!pip install llama-index-readers-structured-data

Additionally, you will need to set up an API key from Llama API to use the Llama API for structured data extraction ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

run-llama / llama_index

[Question]: LlamaIndex retriever on Confluence pages #16958

Question Validation

Question