langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.37k stars 14.77k forks source link

Confluence Loader does not add any distinction between multiple lines of text that are inside a cell in a table when using 'keep_markdown_format = True' #26089

Open MonoMarkor opened 1 week ago

MonoMarkor commented 1 week ago

Checked other resources

Example Code

hello, when i load a conflunce page using the confluence loader, i have noticed a wierd formatting that happens inside a cell of a table when there is a text on multiple lines. when using 'keep_markdown_format = True'

That is, when there are multiple p tags inside a cell of a table

, there is nothing seperating the information that is present inside the tags. No space or new line.

When I tried using keep_markdown_format = False, then the texts inside the table was formatted with a space in between which is good.

I want to keep 'keep_markdown_format = True', is there a way to solve this?

Error Message and Stack Trace (if applicable)

No response

Description

This is a piece of text

This is the second line inside the cell of a table

When using keep_markdown_format = True i get; This is a piece of textThis is the second line inside the cell of a table

When using keep_markdown_format = False i get; This is a piece of text This is the second line inside the cell of a table

As you can see when i set it to True there is nothing seperating multiple lines inside a table

System Info

langchain==0.2.12 langchain-community==0.2.11 Windows 11 Python 3.11.0

MonoMarkor commented 1 week ago

i just found the exact same issue that was closed, but i cant get the solution to work it is here: https://github.com/langchain-ai/langchain/issues/11853

MonoMarkor commented 1 week ago

currently im using this and it is working but its not the best:

if keep_markdown_format:

Use markdownify to keep the page Markdown style

        content=content.replace("</p>", "\n</p>").replace("<br />", "\n")
        text = markdownify(content, heading_style="ATX") + "".join(attachment_texts)