gioelecrispo / chunkipy

chunkipy is an extremely useful tool for segmenting long texts into smaller chunks, based on either a character or token count. With customizable chunk sizes and splitting strategies, chunkipy provides flexibility and control for various text processing tasks.
MIT License
33 stars 0 forks source link

when the chunk size is low, the sentence is being breaked abruptly #5

Closed Ajaytherala closed 2 months ago

Ajaytherala commented 2 months ago

image

gioelecrispo commented 2 months ago

Hello @Ajaytherala, thank you for using chunkipy. The behaviour you are facing is expected. As you can see in the documentation, this is expected to happen if the sentence segmenter is unable to split the text into sentences that have less than chunk_size token.

From README.md:

By default, chunkipy uses stanza are main text splitting method; however, if stanza produces sentences with a number of tokens greater than the chunk size, other split strategy are used. Here the list of predefined strategies, sorted by priority (the first one is executed first, if the piece of text is larger than the chunk size, it is further split using a lower priority strategy).

If this behaviour does not suit your needs, you can provide your own split strategies. Closing as this is not a bug.