nipunsadvilkar / pySBD

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
MIT License
813 stars 84 forks source link

✨Add `char_span` functionality #40

Closed nipunsadvilkar closed 5 years ago

nipunsadvilkar commented 5 years ago

char_span(optional) parameter will return TextSpan object having "sentence_str" & start_character_offset (int), end_character_offset (int) of respective sentence within original text

Example:

import pysbd
text = "My name is Jonas E. Smith. Please turn to p. 55."
seg = pysbd.Segmenter(language="en", clean=False, char_span=True)
print(seg.segment(text))
# [TextSpan(sent='My name is Jonas E. Smith.', start=0, end=26),
# TextSpan(sent='Please turn to p. 55.', start=27, end=48)]