Open elia-bracci-hs opened 4 hours ago
Hey @elia-bracci-hs! π I'm here to help you with your Confluence reader issue. I'm Dosu, a bot that can assist with bugs, questions, and becoming a contributor.
I couldn't find a specific solution for filtering out sections like "Recent updates" or "Blog stream" in the LlamaIndex Confluence reader. However, you can customize the content extraction process by modifying the process_page
method or implementing additional filtering logic after retrieving the content. The process_page
method converts HTML content to text, and you can add custom logic to exclude unwanted sections based on specific keywords or patterns.
If you need to exclude certain metadata fields, you can customize which metadata fields are included or excluded using the following methods:
Exclude metadata from LLM visibility:
document.excluded_llm_metadata_keys = ["file_name", "file_type", "file_size", "creation_date", "last_modified_date", "last_accessed_date"]
Exclude metadata from embedding visibility:
document.excluded_embed_metadata_keys = ["file_name", "file_type", "file_size", "creation_date", "last_modified_date", "last_accessed_date"]
These settings allow you to control which metadata fields are visible to the LLM or embedding model, enabling you to tailor the information used during processing [1][2][3][4].
To continue talking to Dosu, mention @dosu.
There are a LOT of options on this reader. Not 100% sure what they all do, but the source code might be helpful. Maybe there's an easy PR here https://github.com/run-llama/llama_index/blob/293709e725718e70f1976870dd2462fd83d7a727/llama-index-integrations/readers/llama-index-readers-confluence/llama_index/readers/confluence/base.py#L108
Question Validation
Question
Issue Description
When using the LlamaIndex Confluence reader to retrieve page content, the output includes unwanted sections such as "Recent updates," "Blog stream," and dynamically generated metadata like contributors and timestamps.
I want to filter out these sections to focus only on the core content, such as the introductory text and final messages, while excluding unnecessary details: this is how Atlassian API works by giving only the page content.
Example: llama-index-reader-confluence:
Atlassian Confluence API: