Closed GasparQ closed 6 months ago
🤖
Hello @GasparQ,
I'm a bot here to assist you with your issue while we wait for a human maintainer to address it. My purpose is to help you navigate bugs, answer questions, and even guide you on how to contribute to this project. Rest assured, I'm currently reviewing your issue regarding the MarkdownHeaderTextSplitter
and TextSplitter
inheritance. I'll get back to you with a more detailed response shortly.
🤖
Hello @GasparQ!
I'm here to assist you with your issue regarding LangChain. Please note that I'm an AI, but I'll do my best to help you while we wait for a human maintainer to join the conversation. Let's work together to get this sorted out!
Thank you for your detailed question, GasparQ. The MarkdownHeaderTextSplitter
and HTMLHeaderTextSplitter
classes in LangChain are indeed not inheriting from the TextSplitter
interface. This is an intentional design decision.
The MarkdownHeaderTextSplitter
and HTMLHeaderTextSplitter
classes are designed to split Markdown and HTML documents, respectively, into chunks based on the document's header structure. They do not implement the transform_document
method because their functionality is different from the other classes that inherit from TextSplitter
.
To use the MarkdownHeaderTextSplitter
, you would create an instance of the class, passing in a list of headers to split on. Then, you would call its split_text
method with the Markdown document's text as an argument. Here is an example:
from langchain.text_splitter import MarkdownHeaderTextSplitter
markdown_text = "# Header 1\nContent 1\n## Header 2\nContent 2"
splitter = MarkdownHeaderTextSplitter([
("#", "Header 1"),
("##", "Header 2"),
])
documents = splitter.split_text(markdown_text)
The HTMLHeaderTextSplitter
works similarly, but it splits HTML documents based on specified headers. You would create an instance of the class, passing in a list of headers to split on. Then, you would call its split_text_from_url
method with the URL of the HTML document as an argument.
The TextSplitter
class and its subclasses are used in LangChain to split text into chunks or tokens in a way that is appropriate for the type of text and the specific application. The MarkdownHeaderTextSplitter
and HTMLHeaderTextSplitter
classes serve a different purpose and therefore do not inherit from TextSplitter
.
I hope this clarifies your question. If you have further questions, feel free to ask.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
I came up with a similar question, and I believe it would be helpful to align logic here.
In addition to the example that @GasparQ mentioned to demonstrate how it would facilitate applying MarkdownHeaderTextSplitter
to a loaded document, it would be also helpful to inherit metadata.
Currently, after splitting text with:
markdown_text = "# Header 1\nContent 1\n## Header 2\nContent 2"
splitter = MarkdownHeaderTextSplitter([
("#", "Header 1"),
("##", "Header 2"),
])
documents = splitter.split_text(markdown_text)
the metadata would contain only Header 1
and Header 2
fields, and if I would like to update metadata, I need to do something like this:
for document in documents:
document.metadata.update({"source": "my-source-value"})
However, if we pass a document as input following the example from @GasparQ, source
field and other metadata that maybe set on Document
instance should be inherited by the output of MarkdownHeaderTextSplitter
. For example, this is the behaviour of RecursiveCharacterTextSplitter.split_documents
.
System Info
I am facing an issue using
MarkdownHeaderTextSplitter
class and, after looking at the code, I noticed that the problem might be present in several Text Splitters and I did not find any issue on this, so I create a new one.I am trying to use
MarkdownHeaderTextSplitter
regarding theTextSplitter
interface by calling the methodtransform_document
.However, the
MarkdownHeaderTextSplitter
does not inherit fromTextSplitter
and I wondered if it was a justified implementation or just an oversight.It seems that the
HTMLHeaderTextSplitter
is in that case too.Can you give me some insight on how to use theses classes if the behavior is normal ?
Who can help?
No response
Information
Related Components
Reproduction
This code is a good way to reproduce what I am trying to do
Expected behavior
I want this to return a list of documents (
langchain.docstore.document.Document
) splitted in the same wayMarkdownHeaderTextSplitter.split_text
does on the content of a markdown document as presented in the documentation.