Incorrect text span start and end returned

dakinggg commented 5 years ago

Looks like something weird happening in this case, note that the indices of the second text span are incorrect:

>>> seg = pysbd.Segmenter(language='en', clean=False, char_span=True)
>>> seg.segment("1) The first item. 2) The second item.")                                                                                
[TextSpan(sent='1) The first item.', start=0, end=18), TextSpan(sent='2) The second item.', start=0, end=19)]

nipunsadvilkar commented 5 years ago

@danielkingai2 #49 and #53 character offset issue is due not \r not being handled in sentence_boundary_punctuation and returning correct TextSpan.

🚧WIP branch fix can be found here:

https://github.com/nipunsadvilkar/pySBD/tree/npn-carriage-return-fix

https://github.com/nipunsadvilkar/pySBD/blob/995af9e3c3d324aa053cb06f79c0038c882f633d/pysbd/processor.py#L81

What I have been trying to do is, adding ❦ identifier on which sentence should be split. Though segmented sentences are correct, the offsets are getting modified due to the addition of ❦ identifier. Next challenge is, spacy Doc.char_span returns None if I consider white span in character offset - https://github.com/explosion/spaCy/issues/2637. Example:

>>> import spacy
>>> nlp = spacy.blank("en")
>>> text = 'a. The first item b. The second item c. The third list item'
>>> doc = nlp(text)

>>> text[0:17]                                                                                                                              
# 'a. The first item'

>>> text[0:18]                                                                                                                              
# 'a. The first item ' # Note whitespace here

>>> doc.char_span(0,17) # without whitespace offset
# 'a. The first item'

>>> doc.char_span(0,18) # with whitespace offset
# None

It would be nice if you can also think of any workaround or building on top of WIP branch.

dakinggg commented 5 years ago

with respect to the whitespace thing, can't you just take character offsets after trimming trailing whitespace or something? like if the text ends in a whitespace, subtract one from the character offset or something

dakinggg commented 5 years ago

Ah I didn't totally understand what you were saying before, but I think I get it now. So, to rephrase, is it possible to just check the text spans you are returning, and if they start/end with whitespace, edit the start/end indices appropriately? Maybe its ok if up front you state the char span of a sentence explicitly does not include leading or trailing whitespace?

nipunsadvilkar commented 5 years ago

Yes, a possible solution should account for the following:

Original input - a. The first item b. The second item c. The third list item preprocessed - ❦a∯ The first item ❦b∯ The second item ❦c∯ The third list item

non-strict pattern to get text within ❦ - r'[^❦]+'

"a∯ The first item ", 1, 19
"b∯ The second item ", 20, 39
"c∯ The third list item", 40, 62

Actual spans with whitespace:

"a∯ The first item ", 0, 18
"b∯ The second item ", 18, 37
"c∯ The third list item", 37, 59

spaCy char_span needed without whitespaces:

"a∯ The first item ", 0, 17
"b∯ The second item ", 18, 36
"c∯ The third list item", 37, 59

To get actual spans of original input out of preprocessed text one would require to consecutively subtract N number of whitespaces and also the number of times ❦ occurred before.

dakinggg commented 5 years ago

I don't quite understand, is the solution going to be at the level of the pysbd library? or the level of downstream uses of the pysbd library?

nipunsadvilkar commented 5 years ago

Should be within pySBD and getting appropriate TextSpan objects

On Tue, Nov 19, 2019, 11:09 PM Daniel King notifications@github.com wrote:

I don't quite understand, is the solution going to be at the level of the pysbd library? or the level of downstream uses of the pysbd library?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/nipunsadvilkar/pySBD/issues/49?email_source=notifications&email_token=ADS5LCCS6U3QZPSW4IE6RSLQUQQD7A5CNFSM4JJOKMQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEPB74Y#issuecomment-555622387, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADS5LCADOVX5CYNPXX3XVDDQUQQD7ANCNFSM4JJOKMQA .

dkarmon commented 4 years ago

Is there update on when this issue will be resolved?

nipunsadvilkar / pySBD

Incorrect text span start and end returned #49