AuthorProfile files contain invalid characters

The biography, etc. in AuthorProfile files seems to want HTML Entities for special characters, so there is some conversion to be done. At present you get nasty invalid character boxes on the Kindle.

So in this case we want '—' (or '—' the hex equivalent) HTML entity, rather than any unicode.

Side note: maybe there is a way to use unicode in these files, but the above works :)

See below - inf1 is the generated AuthorProfile file, inf2 is the HTML page from goodreads.com. Goodreads contains

>>> sys.argv[0]
'AuthorProfile.profile.B00Q0LB318_mobi.asc'
>>> sys.argv[1]
'Charles Dickens (Author of A Tale of Two Cities).htm'
>>> inf1 = "\n".join(open(sys.argv[0]).readlines())
>>> inf2 = "\n".join(open(sys.argv[1]).readlines())

Here is what is currently being generated - note sure what format this is.. See the blink below, it claims that eg. u'\u2014' is a valid unicode codepoint (codepoint == character in unicode, I think), while '\xe2\x80\x94' would be a valid utf-8 encoded bytestring. God only knows what '\u00e2\u00880\u0094' is - I suspect, but can't confirm, that this is just invalid output.

>>> inf1[1400:1500]
' been praised by fellow writers\\u00e2\\u0080\\u0094from Leo Tolstoy to George Orwell and G. K. Chester'
>>> inf1[1431:1449]
'\\u00e2\\u0080\\u0094'
>>>

Here is how to convert the original HTML to escaped HTML (with HTML entities) - obviously this can be run on larger strings.

>>> inf2[55000:55100]
'low writers\xe2\x80\x94from Leo Tolstoy to George Orwell and G. K. Chesterton\xe2\x80\x94for its realism, comedy, pros'
>>> inf2[55011:55014]
'\xe2\x80\x94'
>>> inf2[55011:55014].decode('utf-8')
u'\u2014'
>>> inf2[55011:55014].decode('utf-8').encode('ascii', "xmlcharrefreplace")
'&#8212;'
>>> inf2[55000:55100].decode('utf-8').encode('ascii', "xmlcharrefreplace")
'low writers&#8212;from Leo Tolstoy to George Orwell and G. K. Chesterton&#8212;for its realism, comedy, pros'
>>>

http://stackoverflow.com/questions/5842115/converting-a-string-which-contains-both-utf-8-encoded-bytestrings-and-codepoints

szarroug3 / X-Ray_Calibre_Plugin

AuthorProfile files contain invalid characters #63