nipunsadvilkar / pySBD

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
MIT License
813 stars 84 forks source link

How to modify segmentation rules by hand? #90

Closed anferico closed 3 years ago

anferico commented 3 years ago

I have the following piece of text which I feed to pysbd.Segmenter:

'Trying to get back to Com. & Adm. through the most direct path in the dark.'

The correct way of handling this text is to keep it as a single sentence, although the segment() method returns:

['Trying to get back to Com.',
 '& Adm.',
 'through the most direct path in the dark.']

How do I tell the segmenter to avoid splitting a sentence at specific abbreviations, in this case "Com." and "Adm"? The poster in the README file states that rules are "easy to modify", so how do I do that?

nipunsadvilkar commented 3 years ago

You would need to tweak Abbreviation class and populate your keywords within it.

On Sun, Feb 14, 2021, 4:10 PM Francesco Cariaggi notifications@github.com wrote:

I have the following piece of text which I feed to pysbd.Segmenter:

'Trying to get back to Com. & Adm. through the most direct path in the dark.'

The correct way of handling this text is to keep it as a single sentence, although the segment() method returns:

['Trying to get back to Com.', '& Adm.', 'through the most direct path in the dark.']

How do I tell the segmenter to avoid splitting a sentence at specific abbreviations, in this case "Com." and "Adm"? The poster in the README file states that rules are "easy to modify", so how do I do that?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nipunsadvilkar/pySBD/issues/90, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADS5LCB6PRZTOE3HH3NVILLS66SBLANCNFSM4XTDHRTQ .

ywang4 commented 3 years ago

Hello, could you provide a more detailed tutorial for how to add custom rules? Thank you!

garyhsu29 commented 3 years ago

Same question here, does anyone knows how to add the rule to Abbreviation?