[Question]: Confluence Reader page content

elia-bracci-hs commented 4 hours ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

Issue Description

When using the LlamaIndex Confluence reader to retrieve page content, the output includes unwanted sections such as "Recent updates," "Blog stream," and dynamically generated metadata like contributors and timestamps.

I want to filter out these sections to focus only on the core content, such as the introductory text and final messages, while excluding unnecessary details: this is how Atlassian API works by giving only the page content.

Example: llama-index-reader-confluence:

Say hello to your colleagues who want to know your name, pronouns, role, team\nand location (or if you\'re remote).\n\n## 📄 Recent pages that I\'ve worked on\n\n## Recent updates\n\n  * LLM BOT Test page 2\n\nNov 08, 2024 • contributed by Elia Bracci\n\n  * LLM BOT test page\n\nNov 08, 2024 • contributed by Elia Bracci\n\n  * Overview\n\nSept 02, 2024 • contributed by Elia Bracci\n\n  * Elia Bracci\n\nSept 02, 2024 • contributed by Elia Bracci\n\n## Blog stream\n\nCreate a blog post to share news and announcements with your team and company.\n\nCreate blog post\n\n## 🖐 Get in touch\n\n✉️\n\n💼\n\n🔗\n\n👤\n\nEnd with a bang! Some options are: "I am so grateful to be here at <Insert\ncompany name> and very excited to get started!" or "Looking forward to meeting\nall of you!" or "Can\'t wait to get to know all of you!"

Atlassian Confluence API:

Say hello to your colleagues who want to know your name, pronouns, role, team and location (or if you're remote). 📄 Recent pages that I've worked on 5 5 titles 🖐 Get in touch ✉️ Insert your email here 💼 Insert your LinkedIn URL here 🔗 Insert your Twitter handle here 👤 Insert your Medium profile here End with a bang! Some options are: "I am so grateful to be here at <Insert company name> and very excited to get started!" or "Looking forward to meeting all of you!" or "Can't wait to get to know

dosubot[bot] commented 4 hours ago

Hey @elia-bracci-hs! 👋 I'm here to help you with your Confluence reader issue. I'm Dosu, a bot that can assist with bugs, questions, and becoming a contributor.

I couldn't find a specific solution for filtering out sections like "Recent updates" or "Blog stream" in the LlamaIndex Confluence reader. However, you can customize the content extraction process by modifying the process_page method or implementing additional filtering logic after retrieving the content. The process_page method converts HTML content to text, and you can add custom logic to exclude unwanted sections based on specific keywords or patterns.

If you need to exclude certain metadata fields, you can customize which metadata fields are included or excluded using the following methods:

Exclude metadata from LLM visibility:

document.excluded_llm_metadata_keys = ["file_name", "file_type", "file_size", "creation_date", "last_modified_date", "last_accessed_date"]

Exclude metadata from embedding visibility:

document.excluded_embed_metadata_keys = ["file_name", "file_type", "file_size", "creation_date", "last_modified_date", "last_accessed_date"]

These settings allow you to control which metadata fields are visible to the LLM or embedding model, enabling you to tailor the information used during processing ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 4 hours ago

There are a LOT of options on this reader. Not 100% sure what they all do, but the source code might be helpful. Maybe there's an easy PR here https://github.com/run-llama/llama_index/blob/293709e725718e70f1976870dd2462fd83d7a727/llama-index-integrations/readers/llama-index-readers-confluence/llama_index/readers/confluence/base.py#L108

run-llama / llama_index