langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
93.57k stars 15.08k forks source link

ExperimentalMarkdownSyntaxTextSplitter mixes text between split_text calls #26440

Open SuryaThiru opened 1 month ago

SuryaThiru commented 1 month ago

Checked other resources

Example Code

from langchain_text_splitters.markdown import ExperimentalMarkdownSyntaxTextSplitter, MarkdownHeaderTextSplitter
import os

splitter = ExperimentalMarkdownSyntaxTextSplitter(
    headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
], strip_headers=False, return_each_line=False)

for file in sorted(os.listdir("testdata")):
    print(file)
    with open(f"testdata/{file}", "r") as f:
        text = f.read()

    splits = splitter.split_text(text)

    for split in splits:
        print(split.metadata)
        print(split.page_content)
        print('-'*80)

    print('='*80)
    print()

Files

Files.zip

Error Message and Stack Trace (if applicable)

Output

sample1.md
{'Header 1': 'Header 1 from file 1'}
# Header 1 from file 1

Content 1 from file 1

--------------------------------------------------------------------------------
{'Header 1': 'Header 1 from file 1', 'Header 2': 'Header 2 from file 1'}
## Header 2 from file 1

Content 2 file file 1

More stuff in file 1

* list1.1
    * list 2.1
    * list 2.2
* list1.2
--------------------------------------------------------------------------------
================================================================================

sample2.md
{'Header 1': 'Header 1 from file 1'}
# Header 1 from file 1

Content 1 from file 1

--------------------------------------------------------------------------------
{'Header 1': 'Header 1 from file 1', 'Header 2': 'Header 2 from file 1'}
## Header 2 from file 1

Content 2 file file 1

More stuff in file 1

* list1.1
    * list 2.1
    * list 2.2
* list1.2
--------------------------------------------------------------------------------
{'Header 1': 'Header 1 from file 2'}
# Header 1 from file 2

Content 1 from file 2

--------------------------------------------------------------------------------
{'Header 1': 'Header 1 from file 2', 'Header 2': 'Header 2 from file 2'}
## Header 2 from file 2

Content 2 file file 2

More stuff in file 2

1. list1.1
    1. list 2.1
    2. list 2.2
1. list1.2
--------------------------------------------------------------------------------
================================================================================

Description

I was testing out the ExperimentalMarkdownSyntaxTextSplitter class due to issues with whitespacing in the MarkdownHeaderTextSplitter. I noticed that the class was mixing up text between subsequent split_text calls.

I do not believe this is intended. Please find the attached zip to reproduce the issue. Happy to help fix the issue.

Let me know if there are stable alternatives to achieve splitting by markdown headers in the mean time.

System Info

python -m langchain_core.sys_info

System Information

OS: Darwin OS Version: Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:46 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6031 Python Version: 3.12.4 (main, Jun 6 2024, 18:26:44) [Clang 15.0.0 (clang-1500.3.9.4)]

Package Information

langchain_core: 0.2.38 langchain: 0.2.16 langchain_community: 0.2.16 langsmith: 0.1.114 langchain_text_splitters: 0.2.4

Optional packages not installed

langgraph langserve

Other Dependencies

aiohttp: 3.10.5 async-timeout: Installed. No version info available. dataclasses-json: 0.6.7 httpx: 0.27.2 jsonpatch: 1.33 numpy: 1.26.4 orjson: 3.10.7 packaging: 24.1 pydantic: 2.8.2 PyYAML: 6.0.2 requests: 2.32.3 SQLAlchemy: 2.0.34 tenacity: 8.5.0 typing-extensions: 4.12.2

chkaty commented 2 days ago

@SuryaThiru Thank you for highlighting this intriguing issue. We are students from the University of Toronto and would be delighted to look into it further.

chkaty commented 21 hours ago

@SuryaThiru We’d like to propose modifying the split_text method to reset relevant attributes at the start of each invocation. This change will ensure that each call processes input independently without carrying over any previous state.

We would appreciate any feedback from the community on this approach. We are looking forward to your thoughts!