hrbrmstr / pubcrawl

🍺📖 Convert 'epub' Files to Text (Use https://github.com/ropensci/epubr instead)
https://github.com/ropensci/epubr
22 stars 2 forks source link

encoding. of course it is encoding... #4

Open RMHogervorst opened 6 years ago

RMHogervorst commented 6 years ago

It would be very nice if the text parsing would default to utf-8, because I have something that doesn't seem to be right. 1001 nights

Generous Dealing of Yahya Son of KhÃ\u0081Lid with A Man Who Forged A Letter in His Name.

should be

Generous Dealing of Yahya Son of KhÁLid with A Man Who Forged A Letter in His Name.
RMHogervorst commented 6 years ago
  The Kingâ\u0080\u0099s Daughter and the Ape

should be

The King’s Daughter and the Ape.
RMHogervorst commented 6 years ago

This is the file that doesn't work (had to zip it, because github doesn't accept epub)

arab.zip

RMHogervorst commented 6 years ago

I extracted a few parts and the html files within are encoded correctly that is, there is a charset tag in the

<meta charset="utf-8" />  

So I guess it could read that tag, or default to utf-8 In https://github.com/hrbrmstr/pubcrawl/blob/master/R/clean-text.R#L5:

if (!inherits(doc, "html_document")) doc <- xml2::read_html(doc)

read_html might need the encoding argument (defaults to "") If I read the html file in directly with rvest::html_text(xml2::read_html("file.html")) it already defaults to utf-8 . So perhaps there is implicit recoding when xslt::xml_xslt is applied to the data?

RMHogervorst commented 6 years ago

nope thats not it (xml2::read_html(doc) would also always default to utf-8).

hrbrmstr commented 6 years ago

So, the default was UTF-8 but I added a pass-through encoding parameter wherever I could and it still looks as though you're going to have to post-process to handle Latin1 or cp1252 (etc) encodings. Vis a vis:

x <- epub_to_text("~/Downloads/b97b.epub", "Latin1")

z <- x$content[1] # just to make it easier to debug in my session

substr(z, 1, 1000) # I added the hard line breaks

[1] "The Book of The Thousand Nights and a Night: a plain and literal translation of the Arabian Nights Entertainments. Translated and annotated by Richard F. Burton; illustrated by Albert Letchford\n    Contents\n      Top\n\tEditorâ\u0080\u0099s Note to this Web 
Edition\n\tDedications to the Original Ten Volumes\n\tThe Translatorâ\u0080\u0099s Foreword.\n\tThe Book of The Thousand Nights and a 
Night\n\tTale of the Trader and the Jinni.\n\tThe First Shaykhâ\u0080\u0099s Story.\n\tThe Second Shaykhâ\u0080\u0099s Story.\n\tThe 
Third Shaykhâ\u0080\u0099s Story.\n\tThe Fisherman and the Jinni.\n\tThe Tale of the Wazir and the Sage Duban.\n\tKing Sindibad and 
his Falcon.\n\tThe Tale of the Husband and the Parrot.\n\tThe Tale of the Prince and the Ogress.\n\tThe Tale of the Ensorcelled 
Prince.\n\tThe Porter and the Three Ladies of Baghdad.\n\tThe First Kalandarâ\u0080\u0099s Tale.\n\tThe Second Kalandarâ\u0080\u0099s 
Tale.\n\tThe Tale of the Envier and the Envied.\n\tThe Third Kalandarâ\u0080\u0099s Tale.\n\tThe Eldest Ladyâ\u0080\u0099s 
Tale.\n\tTale of the Portress.\n\tThe Tale of the Three Apples\n\tTale of Nur Al-Din and his S"

In theory, it should have dealt with ^^ properly since it (honest!) passed it in all the way through and I even do a final iconv() to encoding on the column.

But, if you do (this text is Latin1 btw):

substr(iconv(z, "", to="Latin1"), 1, 1000)

[1] "The Book of The Thousand Nights and a Night: a plain and literal translation of the Arabian Nights Entertainments. Translated 
and annotated by Richard F. Burton; illustrated by Albert Letchford\n    Contents\n      Top\n\tEditor’s Note to this Web 
Edition\n\tDedications to the Original Ten Volumes\n\tThe Translator’s Foreword.\n\tThe Book of The Thousand Nights and a 
Night\n\tTale of the Trader and the Jinni.\n\tThe First Shaykh’s Story.\n\tThe Second Shaykh’s Story.\n\tThe Third Shaykh’s 
Story.\n\tThe Fisherman and the Jinni.\n\tThe Tale of the Wazir and the Sage Duban.\n\tKing Sindibad and his Falcon.\n\tThe Tale of 
the Husband and the Parrot.\n\tThe Tale of the Prince and the Ogress.\n\tThe Tale of the Ensorcelled Prince.\n\tThe Porter and the 
Three Ladies of Baghdad.\n\tThe First Kalandar’s Tale.\n\tThe Second Kalandar’s Tale.\n\tThe Tale of the Envier and the 
Envied.\n\tThe Third Kalandar’s Tale.\n\tThe Eldest Lady’s Tale.\n\tTale of the Portress.\n\tThe Tale of the Three Apples\n\tTale of 
Nur Al-Din and his Son.\n\tThe Hunchback"

it works.

I'll keep this open since it'd like to provide robust support in the long run but at least the iconv() should work ex-post-facto for the edge cases.

hrbrmstr commented 6 years ago

(just saw your extended comments)

aye, i even pass encoding along to it and ensure it's a raw vector when processing and still no-go.

something (IMO) "weird" is happening either as a result of read_html() OR in tibble-land causing some issues but iconv() will work ex post facto.