langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.01k stars 14.64k forks source link

MarkdownHeaderTextSplitter flattens Paragraphs separators into single line breaks #22256

Closed relston closed 1 day ago

relston commented 3 months ago

Checked other resources

Example Code

Example Code to Reproduce

from langchain.textsplitters import MarkdownHeaderTextSplitter

# Sample Markdown input
markdown_text = """
# My Heading

This is a paragraph with some detailed explanation.

This is another separate paragraph.
"""

# Initialize and apply the text splitter
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[('#', 'Header 1')])
result = splitter.split_text(markdown_text)
print(result)

Expected Behavior

The expected behavior would be to keep the paragraph breaks as they are crucial for subsequent text manipulation tasks that may rely on the structure conveyed by separate paragraphs:

[Document(page_content='This is a paragraph with some detailed explanation.\n\nThis is another separate paragraph.', metadata={'Header 1': 'My Heading'})]

Actual Behavior

Currently, the text after being processed by MarkdownHeaderTextSplitter loses paragraph distinctions, flattening into line breaks:

[Document(page_content='This is a paragraph with some detailed explanation.\nThis is another separate paragraph.', metadata={'Header 1': 'My Heading'})]

This issue affects not only readability but also the downstream processing capabilities that require structured and clearly delineated text for effective analysis and feature extraction.

Error Message and Stack Trace (if applicable)

No response

Description

The current implementation of MarkdownHeaderTextSplitter in LangChain notably splits on text on /n and strips out white space from each line when processing Markdown text. This removal of white spaces and paragraph separators (\n\n) directly impacts further text splitting and processing strategies, as it disrupts the natural paragraph structure integral to most textual analyses and transformations.

Other Examples

The white-space-stripping implementation of this text splitter also has been previously identified to be problematic by other users use-cases, as evidenced by issues #20823 and #19436.

System Info

N/A

relston commented 3 months ago

Since I was going to re-implement the Markdown splitter for my own purposes anyway I thought I would share my implementation in the form of an experimental PR https://github.com/langchain-ai/langchain/pull/22257. This would address the issues stated above and implements a few other handy features like breaking out code blocks and tagging with the language metadata if applicable.

If we consider the whitespace stripping output a "bug" than this might be a drop-in replacement since it the signatures match and it would fix that aspect of the current implementation. Definitely would love to know yalls feedback on this! I know it's a heavy handed change but I think it's a good direction to go in.

Cheers!