nipunsadvilkar / pySBD

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
MIT License
782 stars 82 forks source link

Incorrect text span start and end returned #49

Closed dakinggg closed 4 years ago

dakinggg commented 4 years ago

Looks like something weird happening in this case, note that the indices of the second text span are incorrect:

>>> seg = pysbd.Segmenter(language='en', clean=False, char_span=True)
>>> seg.segment("1) The first item. 2) The second item.")                                                                                
[TextSpan(sent='1) The first item.', start=0, end=18), TextSpan(sent='2) The second item.', start=0, end=19)] 
nipunsadvilkar commented 4 years ago

@danielkingai2 #49 and #53 character offset issue is due not \r not being handled in sentence_boundary_punctuation and returning correct TextSpan.

🚧WIP branch fix can be found here:

https://github.com/nipunsadvilkar/pySBD/tree/npn-carriage-return-fix

https://github.com/nipunsadvilkar/pySBD/blob/995af9e3c3d324aa053cb06f79c0038c882f633d/pysbd/processor.py#L81

What I have been trying to do is, adding identifier on which sentence should be split. Though segmented sentences are correct, the offsets are getting modified due to the addition of identifier. Next challenge is, spacy Doc.char_span returns None if I consider white span in character offset - https://github.com/explosion/spaCy/issues/2637. Example:

>>> import spacy
>>> nlp = spacy.blank("en")
>>> text = 'a. The first item b. The second item c. The third list item'
>>> doc = nlp(text)

>>> text[0:17]                                                                                                                              
# 'a. The first item'

>>> text[0:18]                                                                                                                              
# 'a. The first item ' # Note whitespace here

>>> doc.char_span(0,17) # without whitespace offset
# 'a. The first item'

>>> doc.char_span(0,18) # with whitespace offset
# None

It would be nice if you can also think of any workaround or building on top of WIP branch.

dakinggg commented 4 years ago

with respect to the whitespace thing, can't you just take character offsets after trimming trailing whitespace or something? like if the text ends in a whitespace, subtract one from the character offset or something

dakinggg commented 4 years ago

Ah I didn't totally understand what you were saying before, but I think I get it now. So, to rephrase, is it possible to just check the text spans you are returning, and if they start/end with whitespace, edit the start/end indices appropriately? Maybe its ok if up front you state the char span of a sentence explicitly does not include leading or trailing whitespace?

nipunsadvilkar commented 4 years ago

Yes, a possible solution should account for the following:

Original input - a. The first item b. The second item c. The third list item preprocessed - ❦a∯ The first item ❦b∯ The second item ❦c∯ The third list item

non-strict pattern to get text within - r'[^❦]+'

"a∯ The first item ", 1, 19
"b∯ The second item ", 20, 39
"c∯ The third list item", 40, 62

Actual spans with whitespace:

"a∯ The first item ", 0, 18
"b∯ The second item ", 18, 37
"c∯ The third list item", 37, 59

spaCy char_span needed without whitespaces:

"a∯ The first item ", 0, 17
"b∯ The second item ", 18, 36
"c∯ The third list item", 37, 59

To get actual spans of original input out of preprocessed text one would require to consecutively subtract N number of whitespaces and also the number of times occurred before.

dakinggg commented 4 years ago

I don't quite understand, is the solution going to be at the level of the pysbd library? or the level of downstream uses of the pysbd library?

nipunsadvilkar commented 4 years ago

Should be within pySBD and getting appropriate TextSpan objects

On Tue, Nov 19, 2019, 11:09 PM Daniel King notifications@github.com wrote:

I don't quite understand, is the solution going to be at the level of the pysbd library? or the level of downstream uses of the pysbd library?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/nipunsadvilkar/pySBD/issues/49?email_source=notifications&email_token=ADS5LCCS6U3QZPSW4IE6RSLQUQQD7A5CNFSM4JJOKMQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEPB74Y#issuecomment-555622387, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADS5LCADOVX5CYNPXX3XVDDQUQQD7ANCNFSM4JJOKMQA .

dkarmon commented 4 years ago

Is there update on when this issue will be resolved?