Parse issues when unicode is right next to matching entity

stoduk commented 8 years ago

If an entity is mentioned with unicode immediately before or after the term then the parsing is including too much (the unicode and sometimes some other characters). This is a rare case in the books I've looked at, but for foreign language books may be a more common issue..

1) This match is including the two nbsp before "Bob". Helpfully the Kindle copes ok with this (though it highlights the npsp when you hover, so it looks a bit odd). I mostly only care because my Xray tester barfs on this error.

Bob Cratchit is 12 characters long, yet we are matching 19 characters (3x2 byte unicode plus a comma at the end. Not sure if we meant to match the comma either)

>>> rawml_contents[5279:5298]
'\xc2\xa0\xc2\xa0\xc2\xa0Bob Cratchit,'

2) Here the unicode is at the end - an EM-DASH (3 byte character).

"Tiny Tim" is 8 characters long, yet we are matching 16 characters (3 byte unicode character plus the word shall after it!)

>>> rawml_contents[192224:192240]
'Tiny Tim\xe2\x80\x94shall'

szarroug3 commented 8 years ago

I actually did this on purpose. So 'Tiny Tim\xe2\x80\x94shall' converts to Tiny Tim-shall.. When you try to highlight Tim here on the kindle, it actually highlights Tim-shall. I remember because I tried to highlight a character once trying to get x-ray to pop up and it wouldn't because it was highlight the whole thing with the - and the word after it.

Anyways, if we do want to change this, the regex works correctly. It's the code that searches for the end of the phrase we're looking for that's messing this up. I've made it so that in the case of "Tiny Tim's", the highlight includes the 's at the end because when you highlight that on the kindle, it automatically highlights the 's. We should be able to just modify the function (_find_len_word and _find_start in lib\book_parser.py) to stop searching if it encounters a unicode character. Right now, the code stops searching if it finds a space which I believe is what kindle uses when it's trying to find the beginning and end of something you're trying to highlight

_find_start searches for the first character in the phrase. _find_len_word searches for the end of the phrase. Here are some examples:

"Tiny Tim had a good day." -- start should be the opening " and end should be the m in Tim "blah blah blah Tiny Tim" -- start should be the T in Tiny and the end should be the closing " Tiny Tim's -- start should be the T in Tiny and end should be s at the end Tiny Tim, Jane Doe, Jake Doe -- start should be T in Tiny and end should be the , after Tim (kindle highlights commas as part of the word)

On Fri, May 20, 2016 at 6:29 AM Anthony Toole notifications@github.com wrote:

If an entity is mentioned with unicode immediately before or after the term then the parsing is including too much (the unicode and sometimes some other characters). This is a rare case in the books I've looked at, but for foreign language books may be a more common issue..

1) This match is including the two nbsp before "Bob". Helpfully the Kindle copes ok with this (though it highlights the npsp when you hover, so it looks a bit odd). I mostly only care because my Xray tester barfs on this error.

Bob Cratchit is 12 characters long, yet we are matching 19 characters (3x2 byte unicode plus a comma at the end. Not sure if we meant to match the comma either)

rawml_contents[5279:5298] '\xc2\xa0\xc2\xa0\xc2\xa0Bob Cratchit,'

2) Here the unicode is at the end - an EM-DASH (3 byte character).

"Tiny Tim" is 8 characters long, yet we are matching 16 characters (3 byte unicode character plus the word shall after it!)

rawml_contents[192224:192240] 'Tiny Tim\xe2\x80\x94shall'

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/szarroug3/X-Ray_Calibre_Plugin/issues/23

stoduk commented 8 years ago

I just tested this, and you are bang on - seems we have to make the entities match whatever Kindle chooses to highlight (including both the nbsp in the first case, and the -shall in the second). How perverse!

I guess my testing script needs fixing, not the plugin. I wonder if the windows GUI does this - I don't remember coming across it..

anthony$ sqlite3 /Volumes/Kindle/documents/Dickens\,\ Charles/Christmas\ Carol\,\ A\ -\ Charles\ Dickens.sdr/XRAY.entities.B00Q0LB318_mobi.asc 
SQLite version 3.8.5 2014-08-15 22:37:57
Enter ".help" for usage hints.
sqlite> select * from occurrence where start=192224;
8|192224|16
sqlite> select * from entity where id=8;
8|Tiny Tim||1|23|1
sqlite> update occurrence set length=8 where start=192224;
sqlite> select * from occurrence where start=192224;
8|192224|8
sqlite> select * from occurrence where start=5279;
3|5279|19
sqlite> select * from entity where id=3;
3|Bob Cratchit||1|15|1
sqlite> update occurrence set start=5285, length=12 where start=5279;
sqlite> select * from occurrence where start=5279;
sqlite> select * from occurrence where start=5285;
3|5285|12
sqlite> ^D

szarroug3 commented 8 years ago

The windows gui doesn't do this correctly - - another reason I don't like using it. It doesn't even handle 's correctly which frustrated me to no end haha

On Sat, May 21, 2016, 7:36 AM Anthony Toole notifications@github.com wrote:

Closed #23 https://github.com/szarroug3/X-Ray_Calibre_Plugin/issues/23.

— You are receiving this because you commented.

Reply to this email directly or view it on GitHub https://github.com/szarroug3/X-Ray_Calibre_Plugin/issues/23#event-667940379

szarroug3 / X-Ray_Calibre_Plugin

Parse issues when unicode is right next to matching entity #23