segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
753 stars 44 forks source link

Single words incorrectly segmented into character sequences #127

Closed lhcoder closed 2 months ago

lhcoder commented 2 months ago

@bminixhofer Hello, I've encountered an issue within the code at https://github.com/segment-any-text/wtpsplit/blob/0f675f7598376d8739cba8ac7cabd23d7bdef52f/wtpsplit/utils/__init__.py#L452

When processing texts like "Abstract", "Hello", and "Hello Hello", the predicted token_logits values are consistently high. As a result, it tends to split these words into individual character sequences, e.g., ["H", "e", "l", "l", "o"], rather than keeping them as whole tokens.

Perhaps it would be more effective to replace the use of np.min(token_logits) with a fixed smaller value (e.g., -10) to prevent this overly aggressive splitting.

markus583 commented 2 months ago

Hi, which model are you using so you are getting this behavior?

lhcoder commented 2 months ago

sat-12l-sm

markus583 commented 2 months ago

I see, thanks for the info. I'm not sure if replacing np.min(token_logits) will be a sustainable solution; it may have undesired consequences. In general, such very short sequences are uncommon to try to segment so I assume they are somewhat out of domain. In contrast, "Abstract Abstract." works just fine. To work around this, you could create a simple filter depending on string length. Another solution would be to use non-SM models, e.g. sat-12l. SM models have been fine-tuned on a diverse set of sentences, so it is not surprising that sequences that that don't resemble sentences at all fall out of domain. However, the pre-training objective of sat-12lhas been more diverse, so it should be more robust, handling short sequences as well as sentences and paragraphs. Indeed, in my tests, it split your test cases just fine, e.g.:

>>> sat.split("Hello Hello")
['Hello Hello']
markus583 commented 2 months ago

Checked again. and it should actually not be a problem! I pushed a fix in v2.0.8. Thanks for both raising this and providing a solution! I will close the issue.