chunkipy
is an extremely useful tool for segmenting long texts into smaller chunks,
based on either a character or token count. With customizable chunk sizes and splitting strategies, chunkipy
provides flexibility and control
for various text processing tasks.
chunkipy
was created to address the need within the field of natural language processing (NLP)
to chunk text so that it does not exceed the input size of neural networks such as BERT,
but it could be used for several other use cases.
The library offers some useful features:
chunkipy
offers the possibility of
providing a token estimator function, in order to build the chunks taking into account the tokenizer
that will use those chunks.chunkipy
,
in creating the chunks, avoids cutting sentences, and always tries to have a complete and syntactically correct sentence.
This is achieved through the use of the stanza
library, which utilizes semantic models to cut text
into meaningful sentences.chunkipy
offers the possibility to define an overlap_percentage
and create overlapping chunks to
preserve the context along chunks. chunkipy
offers complete flexibility
in choosing how to split, allowing users to define their own text splitting function or choose from a list
of pre-defined splitting strategies.By default, chunkipy
uses stanza
are main text splitting method; however, if stanza
produces
sentences with a number of tokens greater than the chunk size, other split strategy are used.
Here the list of predefined strategies, sorted by priority (the first one is executed first,
if the chunk of text is larger than the chunk size, it is further split using a lower priority
strategy).
Priority | Name | Effect |
---|---|---|
0 | split_by_sentences |
It uses stanza to split the text into meaningful sentences. |
1 | split_by_semicolon |
It splits the text using the semicolon and space ; as separator. |
2 | split_by_colon |
It splits the text using the colon and space : as separator. |
3 | split_by_comma |
It splits the text using the comma and space , as separator. |
4 | split_by_word |
It splits the text using the space as separator. |
You can install chunkipy
using pip:
pip install chunkipy
The main class in chunkipy
is TextChunker
, which takes a text and splits it into chunks of a given size.
You can use the default settings or specify custom parameters for the chunk size,
whether to split by characters or tokens, the tokenizer function to use (if tokens
is set to True`), and the list of split strategies to apply.
Here's an example of using it to chunk text that has a number of tokens greater than the input size of BERT:
from chunkipy import TextChunker, TokenEstimator
from transformers import AutoTokenizer
text = """Napoleon Bonaparte (born Napoleone Buonaparte; 15 August 1769 – 5 May 1821), later known by his regnal name Napoleon I, was a French military commander and political leader who rose to prominence during the French Revolution and led successful campaigns during the Revolutionary Wars. He was the de facto leader of the French Republic as First Consul from 1799 to 1804, then Emperor of the French from 1804 until 1814 and again in 1815. Napoleon's political and cultural legacy endures to this day, as a highly celebrated and controversial leader. He initiated many liberal reforms that have persisted in society, and is considered one of the greatest military commanders in history. His wars and campaigns are studied by militaries all over the world. Between three and six million civilians and soldiers perished in what became known as the Napoleonic Wars.
Napoleon was born on the island of Corsica, not long after its annexation by France, to a native family descending from minor Italian nobility. He supported the French Revolution in 1789 while serving in the French army, and tried to spread its ideals to his native Corsica. He rose rapidly in the Army after he saved the governing French Directory by firing on royalist insurgents. In 1796, he began a military campaign against the Austrians and their Italian allies, scoring decisive victories and becoming a national hero. Two years later, he led a military expedition to Egypt that served as a springboard to political power. He engineered a coup in November 1799 and became First Consul of the Republic.
Differences with the United Kingdom meant France faced the War of the Third Coalition by 1805. Napoleon shattered this coalition with victories in the Ulm campaign, and at the Battle of Austerlitz, which led to the dissolution of the Holy Roman Empire. In 1806, the Fourth Coalition took up arms against him. Napoleon defeated Prussia at the battles of Jena and Auerstedt, marched the Grande Armée into Eastern Europe, and defeated the Russians in June 1807 at Friedland, forcing the defeated nations of the Fourth Coalition to accept the Treaties of Tilsit. Two years later, the Austrians challenged the French again during the War of the Fifth Coalition, but Napoleon solidified his grip over Europe after triumphing at the Battle of Wagram.
Hoping to extend the Continental System, his embargo against Britain, Napoleon invaded the Iberian Peninsula and declared his brother Joseph the King of Spain in 1808. The Spanish and the Portuguese revolted in the Peninsular War aided by a British army, culminating in defeat for Napoleon's marshals. Napoleon launched an invasion of Russia in the summer of 1812. The resulting campaign witnessed the catastrophic retreat of Napoleon's Grande Armée. In 1813, Prussia and Austria joined Russian forces in a Sixth Coalition against France, resulting in a large coalition army defeating Napoleon at the Battle of Leipzig. The coalition invaded France and captured Paris, forcing Napoleon to abdicate in April 1814. He was exiled to the island of Elba, between Corsica and Italy. In France, the Bourbons were restored to power."""
class BertTokenEstimator(TokenEstimator):
def __init__(self):
self.bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def estimate_tokens(self, text):
return len(self.bert_tokenizer.encode(text))
bert_token_estimator = BertTokenEstimator()
print(f"Num of chars: {len(text)}")
# --> Num of chars: 3149
print(f"Num of tokens (using BertTokenEstimator): {bert_token_estimator.estimate_tokens(text)}")
# --> Num of tokens (using BertTokenEstimator): 603
# This creates an instance of the TextChunker class with a chunk size of 512,
# using token-based segmentation and a custom tokenizer function bert_tokenizer.encode to count tokens.
text_chunker = TextChunker(512, tokens=True, token_estimator=BertTokenEstimator())
chunks = text_chunker.chunk(text)
for i, chunk in enumerate(chunks):
print(f"Chunk {i + 1}, num of tokens: {bert_token_estimator.estimate_tokens(chunk)} -> {chunk}")
This code returns:
Chunk 1, num of tokens 476: -> Napoleon Bonaparte (born Napoleone Buonaparte; 15 August 1769 – 5 May 1821), later known by his regnal name Napoleon I, was a French military commander and political leader who rose to prominence during the French Revolution and led successful campaigns during the Revolutionary Wars. He was the de facto leader of the French Republic as First Consul from 1799 to 1804, then Emperor of the French from 1804 until 1814 and again in 1815. Napoleon's political and cultural legacy endures to this day, as a highly celebrated and controversial leader. He initiated many liberal reforms that have persisted in society, and is considered one of the greatest military commanders in history. His wars and campaigns are studied by militaries all over the world. Between three and six million civilians and soldiers perished in what became known as the Napoleonic Wars. Napoleon was born on the island of Corsica, not long after its annexation by France, to a native family descending from minor Italian nobility. He supported the French Revolution in 1789 while serving in the French army, and tried to spread its ideals to his native Corsica. He rose rapidly in the Army after he saved the governing French Directory by firing on royalist insurgents. In 1796, he began a military campaign against the Austrians and their Italian allies, scoring decisive victories and becoming a national hero. Two years later, he led a military expedition to Egypt that served as a springboard to political power. He engineered a coup in November 1799 and became First Consul of the Republic. Differences with the United Kingdom meant France faced the War of the Third Coalition by 1805. Napoleon shattered this coalition with victories in the Ulm campaign, and at the Battle of Austerlitz, which led to the dissolution of the Holy Roman Empire. In 1806, the Fourth Coalition took up arms against him. Napoleon defeated Prussia at the battles of Jena and Auerstedt, marched the Grande Armée into Eastern Europe, and defeated the Russians in June 1807 at Friedland, forcing the defeated nations of the Fourth Coalition to accept the Treaties of Tilsit. Two years later, the Austrians challenged the French again during the War of the Fifth Coalition, but Napoleon solidified his grip over Europe after triumphing at the Battle of Wagram. Hoping to extend the Continental System, his embargo against Britain, Napoleon invaded the Iberian Peninsula and declared his brother Joseph the King of Spain in 1808.
Chunk 2, num of tokens 129: -> The Spanish and the Portuguese revolted in the Peninsular War aided by a British army, culminating in defeat for Napoleon's marshals. Napoleon launched an invasion of Russia in the summer of 1812. The resulting campaign witnessed the catastrophic retreat of Napoleon's Grande Armée. In 1813, Prussia and Austria joined Russian forces in a Sixth Coalition against France, resulting in a large coalition army defeating Napoleon at the Battle of Leipzig. The coalition invaded France and captured Paris, forcing Napoleon to abdicate in April 1814. He was exiled to the island of Elba, between Corsica and Italy. In France, the Bourbons were restored to power.
As you can see, chunkipy
created chunks smaller than the given input size,
512
, counted the tokens using BERT's tokenizer, and didn't cut sentences in half,
producing syntactically correct chunks of text.
You can also use TextChunker
's split_text()
method to split a text into smaller parts
without actually creating the chunks.
This can be useful, for example, when you want to apply further processing to
the text parts before creating the final chunks.
TextChunker
provides you with the overlap functionality.
The system aims to ensure that the last text parts of each segment do not exceed the maximum token limit defined for overlap.
For example, if the chunk_size
is set to 100 and the overlap_percentage
is 0.1, the maximum number of overlapping tokens is 10.
Consequently, if there are sentences or text parts that fit within this token limit, they are added at the beginning of the next segment.
If not, they are skipped to maintain a good ratio between overlap and content.
text_chunker = TextChunker(50, tokens=True, overlap_percent=0.3)
# Set up test input
text = "In this unit test, we are evaluating the overlapping functionality." \
"This is a feature of the TextChunker class, which is important for a proper context keeping. The " \
"goal is to ensure that overlapping chunks are generated correctly. For this purpose, we have chosen a " \
"long text that exceeds 100 tokens. By setting the overlap_percent to 0.3, we expect the " \
"generated chunks to have an overlap of approximately 30%. This will help us verify the effectiveness " \
"of the overlapping feature. The TextChunker class should be able to handle this scenario and " \
"produce the expected results. Let's proceed with running the test and asserting the generated chunks " \
"for proper overlap. "
# Generate chunks with overlapping
chunks = text_chunker.chunk(text)
# Print the resulting chunks
for i, chunk in enumerate(chunks):
print(f"Chunk {i + 1}: {chunk}")
This would output:
Chunk 1: In this unit test, we are evaluating the overlapping functionality. This is a feature of the TextChunker class, which is important for a proper context keeping. The goal is to ensure that overlapping chunks are generated correctly. For this purpose, we have chosen a long text that exceeds 100 tokens.
Chunk 2: For this purpose, we have chosen a long text that exceeds 100 tokens. By setting the overlap_percent to 0.3, we expect the generated chunks to have an overlap of approximately 30%. This will help us verify the effectiveness of the overlapping feature.
Chunk 3: This will help us verify the effectiveness of the overlapping feature. The TextChunker class should be able to handle this scenario and produce the expected results. Let's proceed with running the test and asserting the generated chunks for proper overlap.
If you don't want to use stanza
or you need to use your custom text splitting logic,
you can provide it in the constructor, as shown in this example:
from chunkipy import TextChunker
def split_by_arrow(text):
return [t for t in text.split("->") if t != '' and ' ']
text = "This is a tokenized text -> with custom split strategy."
# Create a TextChunker object with custom split strategy
text_chunker = TextChunker(chunk_size=8, tokens=True,
split_strategies=[split_by_arrow])
chunks = text_chunker.chunk(text)
# Print the resulting chunks
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}: {chunk}")
This would output:
Chunk 1: This is a tokenized text
Chunk 2: with custom split strategy.
If you find a bug or have a feature request, please open an issue on GitHub. Contributions are welcome! Just fork the repository, create a new branch with your changes, and submit a pull request. Please make sure to write tests for your changes and to follow the code style.
TextChunker is licensed under the MIT License.