Closed dakinggg closed 4 years ago
@danielkingai2 #49 and #53 character offset issue is due not \r
not being handled in sentence_boundary_punctuation
and returning correct TextSpan
.
🚧WIP branch fix can be found here:
https://github.com/nipunsadvilkar/pySBD/tree/npn-carriage-return-fix
What I have been trying to do is, adding ❦
identifier on which sentence should be split. Though segmented sentences are correct, the offsets are getting modified due to the addition of ❦
identifier.
Next challenge is, spacy Doc.char_span
returns None
if I consider white span in character offset - https://github.com/explosion/spaCy/issues/2637.
Example:
>>> import spacy
>>> nlp = spacy.blank("en")
>>> text = 'a. The first item b. The second item c. The third list item'
>>> doc = nlp(text)
>>> text[0:17]
# 'a. The first item'
>>> text[0:18]
# 'a. The first item ' # Note whitespace here
>>> doc.char_span(0,17) # without whitespace offset
# 'a. The first item'
>>> doc.char_span(0,18) # with whitespace offset
# None
It would be nice if you can also think of any workaround or building on top of WIP branch.
with respect to the whitespace thing, can't you just take character offsets after trimming trailing whitespace or something? like if the text ends in a whitespace, subtract one from the character offset or something
Ah I didn't totally understand what you were saying before, but I think I get it now. So, to rephrase, is it possible to just check the text spans you are returning, and if they start/end with whitespace, edit the start/end indices appropriately? Maybe its ok if up front you state the char span of a sentence explicitly does not include leading or trailing whitespace?
Yes, a possible solution should account for the following:
Original input - a. The first item b. The second item c. The third list item
preprocessed - ❦a∯ The first item ❦b∯ The second item ❦c∯ The third list item
non-strict pattern to get text within ❦
- r'[^❦]+'
"a∯ The first item ", 1, 19
"b∯ The second item ", 20, 39
"c∯ The third list item", 40, 62
Actual spans with whitespace:
"a∯ The first item ", 0, 18
"b∯ The second item ", 18, 37
"c∯ The third list item", 37, 59
spaCy char_span needed without whitespaces:
"a∯ The first item ", 0, 17
"b∯ The second item ", 18, 36
"c∯ The third list item", 37, 59
To get actual spans of original input out of preprocessed text one would require to consecutively subtract N number of whitespaces and also the number of times ❦
occurred before.
I don't quite understand, is the solution going to be at the level of the pysbd library? or the level of downstream uses of the pysbd library?
Should be within pySBD and getting appropriate TextSpan objects
On Tue, Nov 19, 2019, 11:09 PM Daniel King notifications@github.com wrote:
I don't quite understand, is the solution going to be at the level of the pysbd library? or the level of downstream uses of the pysbd library?
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/nipunsadvilkar/pySBD/issues/49?email_source=notifications&email_token=ADS5LCCS6U3QZPSW4IE6RSLQUQQD7A5CNFSM4JJOKMQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEPB74Y#issuecomment-555622387, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADS5LCADOVX5CYNPXX3XVDDQUQQD7ANCNFSM4JJOKMQA .
Is there update on when this issue will be resolved?
Looks like something weird happening in this case, note that the indices of the second text span are incorrect: