nipunsadvilkar / pySBD

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
MIT License
813 stars 84 forks source link

Handle irregularities between pySBD & pySBD + spaCy sentence output #59

Closed nipunsadvilkar closed 4 years ago

nipunsadvilkar commented 4 years ago

pySBD spaCy pipeline component uses a token-based approach and sets is_sent_start to True or False depending on Spans obtained from pySBD character offsets. We create Span objects using doc.char_span method by creating a slice - doc.text[start:end] which is a sentence span whose first Token object needs to have attribute is_sent_start set to True. On the other hand, if the character indices don’t map to a valid span it returns None . Hence we get irregularities in pySBD & pySBD + spaCy sentence output.

The inability to get Span object from pySBD character offsets can be tackled using the deconstruction of Doc object like the way PKSHATechnology-Research/camphr authors have written get_doc_char_span which uses destruct_token

nipunsadvilkar commented 4 years ago

Fixed #63