Text splitters inhéritance

GasparQ commented 10 months ago

System Info

I am facing an issue using MarkdownHeaderTextSplitter class and, after looking at the code, I noticed that the problem might be present in several Text Splitters and I did not find any issue on this, so I create a new one.

I am trying to use MarkdownHeaderTextSplitter regarding the TextSplitter interface by calling the method transform_document.

However, the MarkdownHeaderTextSplitter does not inherit from TextSplitter and I wondered if it was a justified implementation or just an oversight.

It seems that the HTMLHeaderTextSplitter is in that case too.

Can you give me some insight on how to use theses classes if the behavior is normal ?

Who can help?

No response

Information

[ ] The official example notebooks/scripts
[ ] My own modified scripts

Related Components

[ ] LLMs/Chat Models
[ ] Embedding Models
[ ] Prompts / Prompt Templates / Prompt Selectors
[ ] Output Parsers
[ ] Document Loaders
[X] Vector Stores / Retrievers
[ ] Memory
[ ] Agents / Agent Executors
[ ] Tools / Toolkits
[ ] Chains
[ ] Callbacks/Tracing
[ ] Async

Reproduction

This code is a good way to reproduce what I am trying to do

from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.document_loaders import TextLoader

loader = TextLoader("test.md")
document = loader.load()

transformer = MarkdownHeaderTextSplitter([
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
])

tr_documents = transformer.transform_documents(document)

Expected behavior

I want this to return a list of documents (langchain.docstore.document.Document) splitted in the same way MarkdownHeaderTextSplitter.split_text does on the content of a markdown document as presented in the documentation.

dosubot[bot] commented 10 months ago

🤖

Hello @GasparQ,

I'm a bot here to assist you with your issue while we wait for a human maintainer to address it. My purpose is to help you navigate bugs, answer questions, and even guide you on how to contribute to this project. Rest assured, I'm currently reviewing your issue regarding the MarkdownHeaderTextSplitter and TextSplitter inheritance. I'll get back to you with a more detailed response shortly.

dosubot[bot] commented 10 months ago

🤖

Hello @GasparQ!

I'm here to assist you with your issue regarding LangChain. Please note that I'm an AI, but I'll do my best to help you while we wait for a human maintainer to join the conversation. Let's work together to get this sorted out!

Thank you for your detailed question, GasparQ. The MarkdownHeaderTextSplitter and HTMLHeaderTextSplitter classes in LangChain are indeed not inheriting from the TextSplitter interface. This is an intentional design decision.

The MarkdownHeaderTextSplitter and HTMLHeaderTextSplitter classes are designed to split Markdown and HTML documents, respectively, into chunks based on the document's header structure. They do not implement the transform_document method because their functionality is different from the other classes that inherit from TextSplitter.

To use the MarkdownHeaderTextSplitter, you would create an instance of the class, passing in a list of headers to split on. Then, you would call its split_text method with the Markdown document's text as an argument. Here is an example:

from langchain.text_splitter import MarkdownHeaderTextSplitter

markdown_text = "# Header 1\nContent 1\n## Header 2\nContent 2"
splitter = MarkdownHeaderTextSplitter([
    ("#", "Header 1"),
    ("##", "Header 2"),
])

documents = splitter.split_text(markdown_text)

The HTMLHeaderTextSplitter works similarly, but it splits HTML documents based on specified headers. You would create an instance of the class, passing in a list of headers to split on. Then, you would call its split_text_from_url method with the URL of the HTML document as an argument.

The TextSplitter class and its subclasses are used in LangChain to split text into chunks or tokens in a way that is appropriate for the type of text and the specific application. The MarkdownHeaderTextSplitter and HTMLHeaderTextSplitter classes serve a different purpose and therefore do not inherit from TextSplitter.

I hope this clarifies your question. If you have further questions, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

DSamuylov commented 9 months ago

I came up with a similar question, and I believe it would be helpful to align logic here.

In addition to the example that @GasparQ mentioned to demonstrate how it would facilitate applying MarkdownHeaderTextSplitter to a loaded document, it would be also helpful to inherit metadata.

Currently, after splitting text with:

markdown_text = "# Header 1\nContent 1\n## Header 2\nContent 2"
splitter = MarkdownHeaderTextSplitter([
    ("#", "Header 1"),
    ("##", "Header 2"),
])

documents = splitter.split_text(markdown_text)

the metadata would contain only Header 1 and Header 2 fields, and if I would like to update metadata, I need to do something like this:

for document in documents:
    document.metadata.update({"source": "my-source-value"})

However, if we pass a document as input following the example from @GasparQ, source field and other metadata that maybe set on Document instance should be inherited by the output of MarkdownHeaderTextSplitter. For example, this is the behaviour of RecursiveCharacterTextSplitter.split_documents.

langchain-ai / langchain