szarroug3 / X-Ray_Calibre_Plugin

X-Ray Creator plugin for Calibre
http://www.mobileread.com/forums/showthread.php?t=273189
GNU General Public License v3.0
57 stars 12 forks source link

AuthorProfile files contain invalid characters #63

Closed stoduk closed 8 years ago

stoduk commented 8 years ago

The biography, etc. in AuthorProfile files seems to want HTML Entities for special characters, so there is some conversion to be done. At present you get nasty invalid character boxes on the Kindle.

So in this case we want '—' (or '—' the hex equivalent) HTML entity, rather than any unicode.

Side note: maybe there is a way to use unicode in these files, but the above works :)

See below - inf1 is the generated AuthorProfile file, inf2 is the HTML page from goodreads.com. Goodreads contains

>>> sys.argv[0]
'AuthorProfile.profile.B00Q0LB318_mobi.asc'
>>> sys.argv[1]
'Charles Dickens (Author of A Tale of Two Cities).htm'
>>> inf1 = "\n".join(open(sys.argv[0]).readlines())
>>> inf2 = "\n".join(open(sys.argv[1]).readlines())

Here is what is currently being generated - note sure what format this is.. See the blink below, it claims that eg. u'\u2014' is a valid unicode codepoint (codepoint == character in unicode, I think), while '\xe2\x80\x94' would be a valid utf-8 encoded bytestring. God only knows what '\u00e2\u00880\u0094' is - I suspect, but can't confirm, that this is just invalid output.

>>> inf1[1400:1500]
' been praised by fellow writers\\u00e2\\u0080\\u0094from Leo Tolstoy to George Orwell and G. K. Chester'
>>> inf1[1431:1449]
'\\u00e2\\u0080\\u0094'
>>> 

Here is how to convert the original HTML to escaped HTML (with HTML entities) - obviously this can be run on larger strings.

>>> inf2[55000:55100]
'low writers\xe2\x80\x94from Leo Tolstoy to George Orwell and G. K. Chesterton\xe2\x80\x94for its realism, comedy, pros'
>>> inf2[55011:55014]
'\xe2\x80\x94'
>>> inf2[55011:55014].decode('utf-8')
u'\u2014'
>>> inf2[55011:55014].decode('utf-8').encode('ascii', "xmlcharrefreplace")
'—'
>>> inf2[55000:55100].decode('utf-8').encode('ascii', "xmlcharrefreplace")
'low writers—from Leo Tolstoy to George Orwell and G. K. Chesterton—for its realism, comedy, pros'
>>> 

http://stackoverflow.com/questions/5842115/converting-a-string-which-contains-both-utf-8-encoded-bytestrings-and-codepoints

szarroug3 commented 8 years ago

Yeah, this was something that was really bugging me. I spent a lot of time trying to figure out how to fix it. I'll try what you mentioned. I didn't know about xmlcharrefreplace so hopefully it works.