quanteda / readtext

an R package for reading text files
https://readtext.quanteda.io
118 stars 28 forks source link

Encoding handling not handled by stringi and possibly inconsistent #37

Open kbenoit opened 7 years ago

kbenoit commented 7 years ago

Our README states:

(All ecnoding functions are handled by the stringi package.)

But this is hardly true, since we use the base iconv() that happens through file() in get-functions.R, not stringi.

We should go through carefully to ensure consistency, and also change our claims to be accurate.

adamobeng commented 7 years ago

By my reckoning,

We could replace all of the readLines calls with stri_read_lines (although that function is labelled experimental). Presumably jsonlite and XML know how to deal with their encodings, which leaves html and doc. XML::htmlTreeParse has an encoding option, but I don't think stringi is designed to autodetect encoding of marked-up text. I'm not sure what to do with antiword, it doesn't look like you can specify an output encoding, which means it might be platform-dependent...

adamobeng commented 7 years ago

I should also note that we don't currently "include functions for diagnosing encodings on a file-by-file basis", because the stringi encoding detection stuff is not currently exposed.

kbenoit commented 7 years ago

I'm putting this on the long list for the next release.