run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.85k stars 5.08k forks source link

[Bug]: ConfluenceReader doesn't support table-cells with multiple lines #13050

Closed glenacota closed 2 months ago

glenacota commented 5 months ago

Bug Description

The ConfluenceReader doesn't produce valid markdown when loading Confluence pages with tables having cells with multiple lines.

For example, given a Confluence page with the following table

col1 col2
a list
of items
* in cells
single line

the ConfluenceReader will create a document having the following markdown as content

col1| col2  
---|---  

  * a list 
  * of items
  * in cells

| single line

which is then not rendered as expected

col1 col2

| single line


Note that the LangChain ConfluenceLoader handles multiline table cells correctly - below is the document content generated for the same table used as a reference for the example above

| col1 | col2 |
| --- | --- |
| * a list * of items * in cells | single line |

Version

Version: 0.10.23

Steps to Reproduce

  1. Create/select a Confluence page with a table having a cell with multiple lines
  2. Use the ConfluenceReader to load data from that page
  3. Print the content of the generated document to verify whether the markdown of the table with the multiline cell is valid

An example of a table + expected valid markdown + currently obtained invalid markdown is provided in the description of this bug issue.

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 5 months ago

To address the issue with ConfluenceReader not handling multiline table cells correctly, you can modify the process_page method to better process HTML tables into markdown. This involves using BeautifulSoup to parse the HTML content, identify table elements, and convert them into a markdown format that preserves multiline cells.

Here's a concise approach to achieve this:

  1. Parse the HTML Content: Use BeautifulSoup to parse the page content and find all table elements.
  2. Convert Tables to Markdown: Implement a method, _html_table_to_markdown, that converts HTML tables to markdown. This method should handle multiline cells by replacing newline characters with <br> tags to maintain the multiline appearance in markdown.
  3. Replace HTML Tables with Markdown: In the process_page method, replace the original table HTML with the markdown version generated by your new method.

Here is a simplified example of how you might implement these changes:

from bs4 import BeautifulSoup

def process_page(self, page, include_attachments, text_maker):
    html_content = page["body"]["export_view"]["value"]
    soup = BeautifulSoup(html_content, 'html.parser')

    for table in soup.find_all("table"):
        markdown_table = self._html_table_to_markdown(table)
        table.replace_with(markdown_table)

    text = text_maker.handle(str(soup)) + "".join(attachment_texts)
    return Document(text=text, doc_id=page["id"], extra_info={...})

def _html_table_to_markdown(self, table):
    headers = [th.get_text(strip=True) for th in table.find_all('th')]
    rows = table.find_all('tr')

    markdown = '| ' + ' | '.join(headers) + ' |\n' + '| ' + ' | '.join(['---'] * len(headers)) + ' |\n'

    for row in rows:
        cells = row.find_all(['td', 'th'])
        cell_texts = [cell.get_text(separator="\n", strip=True).replace('\n', '<br>') for cell in cells]
        markdown += '| ' + ' | '.join(cell_texts) + ' |\n'

    return markdown

This code snippet introduces a method to convert HTML tables to markdown, ensuring that multiline cells are correctly formatted. Adjust the process_page method to use this new functionality, which should resolve the issue with invalid markdown for tables with multiline cells.

Let me know if you need further assistance or have any questions regarding this solution.

#### Details **Sources** - [llama-index-integrations/readers/llama-index-readers-confluence/llama_index/readers/confluence/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/readers/llama-index-readers-confluence/llama_index/readers/confluence/base.py)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant.

To continue the conversation, mention @dosu.

glenacota commented 5 months ago

is this an invitation to create a PR myself? I'm confused about the goal of the previous message, @dosu

dosubot[bot] commented 5 months ago

The message was intended to provide a detailed solution for the issue with ConfluenceReader not handling multiline table cells correctly. It suggests a method to modify the process_page function using BeautifulSoup to parse HTML content and convert tables into markdown format that preserves multiline cells. This solution involves identifying table elements, converting them to markdown, and ensuring multiline cells are formatted correctly by replacing newline characters with <br> tags.

This approach is a suggestion for how you could address the issue on your end. It's not explicitly an invitation to create a pull request (PR), but rather guidance on how you might resolve the problem locally in your implementation of ConfluenceReader. If implementing this solution is within your capabilities and you have the necessary permissions, it could be a way to work around the issue until an official update or patch is released.

If you're considering making these changes, please ensure that any modifications you make align with your project's requirements and that you test the updated ConfluenceReader thoroughly to ensure it handles multiline table cells as expected. If you have further questions or need additional assistance, feel free to ask.

#### Details **Sources** - [llama-index-integrations/readers/llama-index-readers-confluence/llama_index/readers/confluence/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/readers/llama-index-readers-confluence/llama_index/readers/confluence/base.py) - [llama-index-legacy/llama_index/legacy/node_parser/relational/unstructured_element.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/node_parser/relational/unstructured_element.py) - [llama-index-core/llama_index/core/node_parser/relational/markdown_element.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/node_parser/relational/markdown_element.py) - [llama-index-legacy/llama_index/legacy/node_parser/relational/markdown_element.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/node_parser/relational/markdown_element.py) - [llama-index-core/llama_index/core/node_parser/relational/utils.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/node_parser/relational/utils.py) - [llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/simple_web/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/simple_web/base.py)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant.

To continue the conversation, mention @dosu.