pySBD spaCy pipeline component uses a token-based approach and sets is_sent_start to True or False depending on Spans obtained from pySBD character offsets. We create Span objects using doc.char_span method by creating a slice - doc.text[start:end] which is a sentence span whose first Token object needs to have attribute is_sent_start set to True. On the other hand, if the character indices don’t map to a valid span it returns None . Hence we get irregularities in pySBD & pySBD + spaCy sentence output.
pySBD spaCy pipeline component uses a token-based approach and sets
is_sent_start
toTrue
orFalse
depending onSpan
s obtained from pySBD character offsets. We createSpan
objects usingdoc.char_span
method by creating a slice -doc.text[start:end]
which is a sentence span whose firstToken
object needs to have attributeis_sent_start
set toTrue
. On the other hand, if the character indices don’t map to a valid span it returnsNone
. Hence we get irregularities in pySBD & pySBD + spaCy sentence output.The inability to get
Span
object from pySBD character offsets can be tackled using the deconstruction ofDoc
object like the way PKSHATechnology-Research/camphr authors have writtenget_doc_char_span
which usesdestruct_token