The biography, etc. in AuthorProfile files seems to want HTML Entities for special characters, so there is some conversion to be done. At present you get nasty invalid character boxes on the Kindle.
So in this case we want '—' (or '—' the hex equivalent) HTML entity, rather than any unicode.
Side note: maybe there is a way to use unicode in these files, but the above works :)
See below - inf1 is the generated AuthorProfile file, inf2 is the HTML page from goodreads.com. Goodreads contains
>>> sys.argv[0]
'AuthorProfile.profile.B00Q0LB318_mobi.asc'
>>> sys.argv[1]
'Charles Dickens (Author of A Tale of Two Cities).htm'
>>> inf1 = "\n".join(open(sys.argv[0]).readlines())
>>> inf2 = "\n".join(open(sys.argv[1]).readlines())
Here is what is currently being generated - note sure what format this is.. See the blink below, it claims that eg. u'\u2014' is a valid unicode codepoint (codepoint == character in unicode, I think), while '\xe2\x80\x94' would be a valid utf-8 encoded bytestring. God only knows what '\u00e2\u00880\u0094' is - I suspect, but can't confirm, that this is just invalid output.
>>> inf1[1400:1500]
' been praised by fellow writers\\u00e2\\u0080\\u0094from Leo Tolstoy to George Orwell and G. K. Chester'
>>> inf1[1431:1449]
'\\u00e2\\u0080\\u0094'
>>>
Here is how to convert the original HTML to escaped HTML (with HTML entities) - obviously this can be run on larger strings.
>>> inf2[55000:55100]
'low writers\xe2\x80\x94from Leo Tolstoy to George Orwell and G. K. Chesterton\xe2\x80\x94for its realism, comedy, pros'
>>> inf2[55011:55014]
'\xe2\x80\x94'
>>> inf2[55011:55014].decode('utf-8')
u'\u2014'
>>> inf2[55011:55014].decode('utf-8').encode('ascii', "xmlcharrefreplace")
'—'
>>> inf2[55000:55100].decode('utf-8').encode('ascii', "xmlcharrefreplace")
'low writers—from Leo Tolstoy to George Orwell and G. K. Chesterton—for its realism, comedy, pros'
>>>
Yeah, this was something that was really bugging me. I spent a lot of time trying to figure out how to fix it. I'll try what you mentioned. I didn't know about xmlcharrefreplace so hopefully it works.
The biography, etc. in AuthorProfile files seems to want HTML Entities for special characters, so there is some conversion to be done. At present you get nasty invalid character boxes on the Kindle.
So in this case we want '—' (or '—' the hex equivalent) HTML entity, rather than any unicode.
Side note: maybe there is a way to use unicode in these files, but the above works :)
See below - inf1 is the generated AuthorProfile file, inf2 is the HTML page from goodreads.com. Goodreads contains
Here is what is currently being generated - note sure what format this is.. See the blink below, it claims that eg. u'\u2014' is a valid unicode codepoint (codepoint == character in unicode, I think), while '\xe2\x80\x94' would be a valid utf-8 encoded bytestring. God only knows what '\u00e2\u00880\u0094' is - I suspect, but can't confirm, that this is just invalid output.
Here is how to convert the original HTML to escaped HTML (with HTML entities) - obviously this can be run on larger strings.
http://stackoverflow.com/questions/5842115/converting-a-string-which-contains-both-utf-8-encoded-bytestrings-and-codepoints