Closed Ajaytherala closed 2 months ago
Hello @Ajaytherala, thank you for using chunkipy
.
The behaviour you are facing is expected. As you can see in the documentation, this is expected to happen if the sentence segmenter is unable to split the text into sentences that have less than chunk_size
token.
From README.md:
By default,
chunkipy
usesstanza
are main text splitting method; however, if stanza produces sentences with a number of tokens greater than the chunk size, other split strategy are used. Here the list of predefined strategies, sorted by priority (the first one is executed first, if the piece of text is larger than the chunk size, it is further split using a lower priority strategy).
If this behaviour does not suit your needs, you can provide your own split strategies. Closing as this is not a bug.