Closed rggdmonk closed 1 year ago
Hi, thanks for the issue!
Is there any way to detect this?
Unfortunately no, not that I could think of.
A reason might be that very short sentences are a bit out-of-distribution for the model, it is trained on chunks of 512 characters each. You could try passing this sentence as part of a larger chunk of text and see how that influences the split probability.
Besides that, increasing the threshold is an option. You can play around with the threshold not only for the 'regular' model but also for the model adapted to a particular style (there the default is usually ~0.5, but it's not exposed in the API). That may solve your issue.
A reason might be that very short sentences are a bit out-of-distribution for the model, it is trained on chunks of 512 characters each. You could try passing this sentence as part of a larger chunk of text and see how that influences the split probability.
Thanks I will try.
Besides that, increasing the threshold is an option. You can play around with the threshold not only for the 'regular' model but also for the model adapted to a particular style (there the default is usually ~0.5, but it's not exposed in the API). That may solve your issue.
It seems not working for short?
from wtpsplit import WtP
wtp = WtP("wtp-canine-s-12l")
text = "‘Make sure it does,’ Vaughn said."
wtp.split(text, lang_code="en", style="ud")
wtp.split(text, lang_code="en", style="ud", threshold=0.01)
wtp.split(text, lang_code="en", style="ud", threshold=0.01)
wtp.split(text, lang_code="en", style="ud", threshold=0.99)
wtp.split(text, lang_code="en", style="ud", threshold=1.0)
wtp.split(text, lang_code="en", style="ud", threshold=2.0)
wtp.split(text, lang_code="en", style="ud", threshold=0.9)
# same output ['‘Make sure it does,’ ', 'Vaughn said.']
wtp.split(text, lang_code="en", style="ersatz")
wtp.split(text, lang_code="en", style="ersatz", threshold=0.9)
# same output ['‘Make sure it does,’ ', 'Vaughn said.']
wtp.split(text, lang_code="en", style="opus100")
wtp.split(text, lang_code="en", style="opus100", threshold=0.0001)
# same output ['‘Make sure it does,’ Vaughn said.']
Good catch, it was overwritten by the default threshold when using adaptation. It's fixed in v1.1.0!
Hello, thank you for your great work!
I noticed unusual splits in a short sentence. I assume this is due to the name.
Is there any way to detect this?
Tested: Version 1.0.1 , colab CPU