Closed glenacota closed 2 months ago
To address the issue with ConfluenceReader
not handling multiline table cells correctly, you can modify the process_page
method to better process HTML tables into markdown. This involves using BeautifulSoup to parse the HTML content, identify table elements, and convert them into a markdown format that preserves multiline cells.
Here's a concise approach to achieve this:
_html_table_to_markdown
, that converts HTML tables to markdown. This method should handle multiline cells by replacing newline characters with <br>
tags to maintain the multiline appearance in markdown.process_page
method, replace the original table HTML with the markdown version generated by your new method.Here is a simplified example of how you might implement these changes:
from bs4 import BeautifulSoup
def process_page(self, page, include_attachments, text_maker):
html_content = page["body"]["export_view"]["value"]
soup = BeautifulSoup(html_content, 'html.parser')
for table in soup.find_all("table"):
markdown_table = self._html_table_to_markdown(table)
table.replace_with(markdown_table)
text = text_maker.handle(str(soup)) + "".join(attachment_texts)
return Document(text=text, doc_id=page["id"], extra_info={...})
def _html_table_to_markdown(self, table):
headers = [th.get_text(strip=True) for th in table.find_all('th')]
rows = table.find_all('tr')
markdown = '| ' + ' | '.join(headers) + ' |\n' + '| ' + ' | '.join(['---'] * len(headers)) + ' |\n'
for row in rows:
cells = row.find_all(['td', 'th'])
cell_texts = [cell.get_text(separator="\n", strip=True).replace('\n', '<br>') for cell in cells]
markdown += '| ' + ' | '.join(cell_texts) + ' |\n'
return markdown
This code snippet introduces a method to convert HTML tables to markdown, ensuring that multiline cells are correctly formatted. Adjust the process_page
method to use this new functionality, which should resolve the issue with invalid markdown for tables with multiline cells.
Let me know if you need further assistance or have any questions regarding this solution.
To continue the conversation, mention @dosu.
is this an invitation to create a PR myself? I'm confused about the goal of the previous message, @dosu
The message was intended to provide a detailed solution for the issue with ConfluenceReader
not handling multiline table cells correctly. It suggests a method to modify the process_page
function using BeautifulSoup to parse HTML content and convert tables into markdown format that preserves multiline cells. This solution involves identifying table elements, converting them to markdown, and ensuring multiline cells are formatted correctly by replacing newline characters with <br>
tags.
This approach is a suggestion for how you could address the issue on your end. It's not explicitly an invitation to create a pull request (PR), but rather guidance on how you might resolve the problem locally in your implementation of ConfluenceReader
. If implementing this solution is within your capabilities and you have the necessary permissions, it could be a way to work around the issue until an official update or patch is released.
If you're considering making these changes, please ensure that any modifications you make align with your project's requirements and that you test the updated ConfluenceReader
thoroughly to ensure it handles multiline table cells as expected. If you have further questions or need additional assistance, feel free to ask.
To continue the conversation, mention @dosu.
Bug Description
The ConfluenceReader doesn't produce valid markdown when loading Confluence pages with tables having cells with multiple lines.
For example, given a Confluence page with the following table
of items
* in cells
the ConfluenceReader will create a document having the following markdown as content
which is then not rendered as expected
| single line
Note that the LangChain ConfluenceLoader handles multiline table cells correctly - below is the document content generated for the same table used as a reference for the example above
Version
Version: 0.10.23
Steps to Reproduce
An example of a table + expected valid markdown + currently obtained invalid markdown is provided in the description of this bug issue.
Relevant Logs/Tracbacks
No response