segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
624 stars 36 forks source link

Unusual splits in short sentence #90

Closed rggdmonk closed 1 year ago

rggdmonk commented 1 year ago

Hello, thank you for your great work!

I noticed unusual splits in a short sentence. I assume this is due to the name.

Is there any way to detect this?

from wtpsplit import WtP

wtp = WtP("wtp-canine-s-12l")

issue = """‘Make sure it does,’ Vaughn said."""
expected = ["""‘Make sure it does,’ Vaughn said."""]

wtp.split(issue, lang_code="en")

# wrong ['‘Make sure it does,’ ', 'Vaughn ', 'said.']

wtp.split(issue,  lang_code="en",  style="ud")

# wrong ['‘Make sure it does,’ ', 'Vaughn said.']

wtp.split(issue,  lang_code="en",  style="opus100")

# correct ['‘Make sure it does,’ Vaughn said.']

wtp.split(issue,  lang_code="en",  style="ersatz")

# wrong ['‘Make sure it does,’ ', 'Vaughn said.']

wtp.split(issue,  lang_code="en", threshold=0.99)

# correct ['‘Make sure it does,’ Vaughn said.']

Tested: Version 1.0.1 , colab CPU

bminixhofer commented 1 year ago

Hi, thanks for the issue!

Is there any way to detect this?

Unfortunately no, not that I could think of.

A reason might be that very short sentences are a bit out-of-distribution for the model, it is trained on chunks of 512 characters each. You could try passing this sentence as part of a larger chunk of text and see how that influences the split probability.

Besides that, increasing the threshold is an option. You can play around with the threshold not only for the 'regular' model but also for the model adapted to a particular style (there the default is usually ~0.5, but it's not exposed in the API). That may solve your issue.

rggdmonk commented 1 year ago

A reason might be that very short sentences are a bit out-of-distribution for the model, it is trained on chunks of 512 characters each. You could try passing this sentence as part of a larger chunk of text and see how that influences the split probability.

Thanks I will try.

Besides that, increasing the threshold is an option. You can play around with the threshold not only for the 'regular' model but also for the model adapted to a particular style (there the default is usually ~0.5, but it's not exposed in the API). That may solve your issue.

It seems not working for short?

from wtpsplit import WtP

wtp = WtP("wtp-canine-s-12l")

text = "‘Make sure it does,’ Vaughn said."

wtp.split(text,  lang_code="en", style="ud")

wtp.split(text,  lang_code="en", style="ud", threshold=0.01)

wtp.split(text,  lang_code="en", style="ud", threshold=0.01)

wtp.split(text,  lang_code="en", style="ud", threshold=0.99)

wtp.split(text,  lang_code="en", style="ud", threshold=1.0)

wtp.split(text,  lang_code="en", style="ud", threshold=2.0)

wtp.split(text,  lang_code="en", style="ud", threshold=0.9)

# same output ['‘Make sure it does,’ ', 'Vaughn said.']

wtp.split(text,  lang_code="en",  style="ersatz")

wtp.split(text,  lang_code="en",  style="ersatz", threshold=0.9)

# same output ['‘Make sure it does,’ ', 'Vaughn said.']

wtp.split(text,  lang_code="en",  style="opus100")

wtp.split(text,  lang_code="en",  style="opus100", threshold=0.0001)

# same output ['‘Make sure it does,’ Vaughn said.']
bminixhofer commented 1 year ago

Good catch, it was overwritten by the default threshold when using adaptation. It's fixed in v1.1.0!