nipunsadvilkar / pySBD

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
MIT License
802 stars 83 forks source link

Arabic sentence split on the Arabic comma #113

Open ymoslem opened 2 years ago

ymoslem commented 2 years ago

Describe the bug Arabic sentence split on the Arabic comma.

To Reproduce Steps to reproduce the behavior:

import pysbd
text = "هذه تجربة، للغة العربية"
seg = pysbd.Segmenter(language="ar", clean=True)
>>> print(seg.segment(text))

Output: ['هذه تجربة،', 'للغة العربية']

Expected behavior The text should not be split on the Arabic comma. Expected output: ['هذه تجربة، للغة العربية']

Additional context I locally fixed it by modifying the file: pysbd/lang/arabic.py, deleting ، from SENTENCE_BOUNDARY_REGEX.