In OpenSearch 2.13, we released text chunking processor. This processor enables users to chunk documents to avoid information loss by truncation from embedding models. This RFC introduces markdown algorithm from the RFC on text chunking. We are initiating this RFC to solicit feedbacks in order to determine whether this algorithm is truly needed by the users.
Introduction
This algorithm is dedicated for markdown file. Within markdown document, the hierarchy structure provides related context for passages under subtitles. We can construct a tree based on the title levels from the doc. Given a node in the tree, we include titles and contents from its path to the root title, including all ancestor nodes. Users can configure the max depth of tree node. We provide a few examples so that you can better understand this algorithm.
Examples
Here is a simple example of markdown file:
// input
# Root title
Root content
## Title 1
Content 1
### Title 1.1
Content 1.1
### Title 1.2
Content 1.2
## Title 2
Content 2
### Title 2.1
Content 2.1
### Title 2.2
Content 2.2
By this example, the constructed tree should be like
Example 1
// output when max_depth = 1
[
'''
# Root title
Root content
## Title 1
Content 1
### Title 1.1
Content 1.1
### Title 1.2
Content 1.2
## Sub title 2
Content 2
### Title 2.1
Content 2.1
### Title 2.2
Content 2.2
'''
]
Example 2
// output when max_depth = 2
[
'''
# Root title
Root content
## Title 1
Content 1
### Title 1.1
Content 1.1
### Title 1.2
Content 1.2
'''
,
'''
# Root title
Root content
## Title 2
Content 2
### Title 2.1
Content 2.1
### Title 2.2
Content 2.2
'''
]
Example 3
// output when max_depth = 2
[
'''
Root title
Root content
Title 1
Content 1
Title 1.1
Content 1.1
'''
,
'''
Root title
Root content
Title 1
Content 1
Title 1.2
Content 1.2
'''
,
'''
Root title
Root content
Title 2
Content 2
Title 2.1
Content 2.1
'''
,
'''
Root title
Root content
Title 2
Content 2
Title 2.2
Content 2.2
'''
]
Pros and cons
Here are the pros and cons of this algorithm.
Pros
Both existing chunking algorithms are too naive to chunk documents in an organic and coherent manner. This algorithm enables user to maintain context information for each section under subtitle.
Cons
The algorithm is not applicable to other text formats with hierarchy structure like html and wikipedia.
Even for markdown formatted documents, the algorithm is only applicable to documents where most contents are located under “leaf node”.
The algorithm may cause extra space consumption due to overlapping contents from root title and root content.
We may need to cascade multiple chunking algorithms if root title and root content themselves are longer than the truncation limit of the text embedding model. By doing so, the context information towards root title is also missed by the downstream algorithm.
Parameters
Parameter
Required/Optional
Data type
Description
max_depth
Optional
Int
The max depth for title in markdown formatted texts. Default is 3.
max_chunk_limit
Optional
Int
The chunk limit for chunking algorithms. Default is 100. Users can set this value to -1 to disable this parameter.
We have two parameters in markdown algorithm, where the max_chunk_limit parameter follows other chunking algorithms and the max_depth parameter means the deepest title we consider.
API
Here is an example to create an ingestion pipeline with markdown algorithm
In OpenSearch 2.13, we released text chunking processor. This processor enables users to chunk documents to avoid information loss by truncation from embedding models. This RFC introduces markdown algorithm from the RFC on text chunking. We are initiating this RFC to solicit feedbacks in order to determine whether this algorithm is truly needed by the users.
Introduction
This algorithm is dedicated for markdown file. Within markdown document, the hierarchy structure provides related context for passages under subtitles. We can construct a tree based on the title levels from the doc. Given a node in the tree, we include titles and contents from its path to the root title, including all ancestor nodes. Users can configure the max depth of tree node. We provide a few examples so that you can better understand this algorithm.
Examples
Here is a simple example of markdown file:
By this example, the constructed tree should be like
Example 1
Example 2
Example 3
Pros and cons
Here are the pros and cons of this algorithm.
Pros
Cons
Parameters
We have two parameters in markdown algorithm, where the max_chunk_limit parameter follows other chunking algorithms and the max_depth parameter means the deepest title we consider.
API
Here is an example to create an ingestion pipeline with markdown algorithm