nipunsadvilkar / pySBD

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
MIT License
801 stars 83 forks source link

Examples of modifying sentence segmentation rules. #108

Closed delvinso closed 2 years ago

delvinso commented 2 years ago

Hi apologies if this is documented - I've looked at current and past issues as well and the only reference I could find is #90 but there doesn't seem to be an explanation. For reference this is the original issue:

I have the following piece of text which I feed to pysbd.Segmenter:

'Trying to get back to Com. & Adm. through the most direct path in the dark.'

The correct way of handling this text is to keep it as a single sentence, although the segment() method returns:

['Trying to get back to Com.',
 '& Adm.',
 'through the most direct path in the dark.']

How do I tell the segmenter to avoid splitting a sentence at specific abbreviations, in this case "Com." and "Adm"? The poster in the README file states that rules are "easy to modify", so how do I do that?

Are there any examples of how to modify the current rules in place? I'm looking to use this for clinical text and it seems to offer improvements over another, default implementation of sentence segmentation, particularly when it comes to handling lists.

Thanks!

nipunsadvilkar commented 2 years ago

Hey @delvinso thanks for using pysbd.

Unfortunately, there is no specific documentation about modifying rules as there are so many and each rule is associated with some form of transformation which is taken as a input by other rule.

To illustrate it further:

https://github.com/nipunsadvilkar/pySBD/blob/5905f13be4fc95f407b98392e0ec303617a33d86/pysbd/processor.py#L32-L37

As you can see above, all those operations needs to be performed in that sequence as they are interrelated. The way these are structured are https://github.com/diasks2/pragmatic_segmenter decision choice, I just ported those from Ruby to Python.

The way to tackle your edge cases would be by diving in the source code and see where your sentence is getting segemented wrongly?

Pro tip:

Best way is to use python debugger and see how your input text goes through different transformations to get clean sentence.

Let me know if this helps

nipunsadvilkar commented 2 years ago

Closing the issue as there is no specific documentation for this.