segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
677 stars 39 forks source link

The split does not look right for this particular case. #18

Closed Navaneethsen closed 3 years ago

Navaneethsen commented 3 years ago

Hi,

My sentence is as shown below: What's working and what needs to change? Not everybody Dr.Jones, has the opportunity to watch themselves after they've had a date to see what they're doing right or wrong, so that you will only know what to do in the next day. Yeah, but it's such an important exercise that they needed to do. Last week they went on their first date, which is a huge step for our single wives, and a great time for us to watch your dates..

When I split it using nnsplit the split sentences are shown below:

I don't think this is right. Will you please let me know if these splits can be improved.

bminixhofer commented 3 years ago

Hi, generally the splitter will make some mistakes, it's a statistical model. For this specific text, you can fix the mistake by increasing the threshold (default is 0.8):

In [1]: from nnsplit import NNSplit                                             

In [2]: splitter = NNSplit.load("en", threshold=0.99)                           

In [3]: splits = splitter.split(["What's working and what needs to change? Not e
   ...: verybody Dr.Jones, has the opportunity to watch themselves after they've
   ...:  had a date to see what they're doing right or wrong, so that you will o
   ...: nly know what to do in the next day. Yeah, but it's such an important ex
   ...: ercise that they needed to do. Last week they went on their first date, 
   ...: which is a huge step for our single wives, and a great time for us to wa
   ...: tch your dates.."])[0]                                                  

In [4]: [str(x) for x in splits]                                                
Out[4]: 
["What's working and what needs to change? ",
 "Not everybody Dr.Jones, has the opportunity to watch themselves after they've had a date to see what they're doing right or wrong, so that you will only know what to do in the next day. ",
 "Yeah, but it's such an important exercise that they needed to do. ",
 'Last week they went on their first date, which is a huge step for our single wives, and a great time for us to watch your dates..']

but then it will miss some other splits, especially where punctuation is missing.

That said I am not entirely satisfied with the quality of the current model, I'll try some things to improve it.