Open RMHogervorst opened 6 years ago
The Kingâ\u0080\u0099s Daughter and the Ape
should be
The King’s Daughter and the Ape.
This is the file that doesn't work (had to zip it, because github doesn't accept epub)
I extracted a few parts and the html files within are encoded correctly that is, there is a charset tag in the
<meta charset="utf-8" />
So I guess it could read that tag, or default to utf-8 In https://github.com/hrbrmstr/pubcrawl/blob/master/R/clean-text.R#L5:
if (!inherits(doc, "html_document")) doc <- xml2::read_html(doc)
read_html might need the encoding argument (defaults to "")
If I read the html file in directly with rvest::html_text(xml2::read_html("file.html"))
it already defaults to utf-8 . So perhaps there is implicit recoding when xslt::xml_xslt is applied to the data?
nope thats not it (xml2::read_html(doc) would also always default to utf-8).
So, the default was UTF-8 but I added a pass-through encoding
parameter wherever I could and it still looks as though you're going to have to post-process to handle Latin1 or cp1252 (etc) encodings. Vis a vis:
x <- epub_to_text("~/Downloads/b97b.epub", "Latin1")
z <- x$content[1] # just to make it easier to debug in my session
substr(z, 1, 1000) # I added the hard line breaks
[1] "The Book of The Thousand Nights and a Night: a plain and literal translation of the Arabian Nights Entertainments. Translated and annotated by Richard F. Burton; illustrated by Albert Letchford\n Contents\n Top\n\tEditorâ\u0080\u0099s Note to this Web
Edition\n\tDedications to the Original Ten Volumes\n\tThe Translatorâ\u0080\u0099s Foreword.\n\tThe Book of The Thousand Nights and a
Night\n\tTale of the Trader and the Jinni.\n\tThe First Shaykhâ\u0080\u0099s Story.\n\tThe Second Shaykhâ\u0080\u0099s Story.\n\tThe
Third Shaykhâ\u0080\u0099s Story.\n\tThe Fisherman and the Jinni.\n\tThe Tale of the Wazir and the Sage Duban.\n\tKing Sindibad and
his Falcon.\n\tThe Tale of the Husband and the Parrot.\n\tThe Tale of the Prince and the Ogress.\n\tThe Tale of the Ensorcelled
Prince.\n\tThe Porter and the Three Ladies of Baghdad.\n\tThe First Kalandarâ\u0080\u0099s Tale.\n\tThe Second Kalandarâ\u0080\u0099s
Tale.\n\tThe Tale of the Envier and the Envied.\n\tThe Third Kalandarâ\u0080\u0099s Tale.\n\tThe Eldest Ladyâ\u0080\u0099s
Tale.\n\tTale of the Portress.\n\tThe Tale of the Three Apples\n\tTale of Nur Al-Din and his S"
In theory, it should have dealt with ^^ properly since it (honest!) passed it in all the way through and I even do a final iconv()
to encoding
on the column.
But, if you do (this text is Latin1 btw):
substr(iconv(z, "", to="Latin1"), 1, 1000)
[1] "The Book of The Thousand Nights and a Night: a plain and literal translation of the Arabian Nights Entertainments. Translated
and annotated by Richard F. Burton; illustrated by Albert Letchford\n Contents\n Top\n\tEditor’s Note to this Web
Edition\n\tDedications to the Original Ten Volumes\n\tThe Translator’s Foreword.\n\tThe Book of The Thousand Nights and a
Night\n\tTale of the Trader and the Jinni.\n\tThe First Shaykh’s Story.\n\tThe Second Shaykh’s Story.\n\tThe Third Shaykh’s
Story.\n\tThe Fisherman and the Jinni.\n\tThe Tale of the Wazir and the Sage Duban.\n\tKing Sindibad and his Falcon.\n\tThe Tale of
the Husband and the Parrot.\n\tThe Tale of the Prince and the Ogress.\n\tThe Tale of the Ensorcelled Prince.\n\tThe Porter and the
Three Ladies of Baghdad.\n\tThe First Kalandar’s Tale.\n\tThe Second Kalandar’s Tale.\n\tThe Tale of the Envier and the
Envied.\n\tThe Third Kalandar’s Tale.\n\tThe Eldest Lady’s Tale.\n\tTale of the Portress.\n\tThe Tale of the Three Apples\n\tTale of
Nur Al-Din and his Son.\n\tThe Hunchback"
it works.
I'll keep this open since it'd like to provide robust support in the long run but at least the iconv()
should work ex-post-facto for the edge cases.
(just saw your extended comments)
aye, i even pass encoding
along to it and ensure it's a raw vector when processing and still no-go.
something (IMO) "weird" is happening either as a result of read_html()
OR in tibble-land causing some issues but iconv()
will work ex post facto.
It would be very nice if the text parsing would default to utf-8, because I have something that doesn't seem to be right. 1001 nights
should be