quanteda / readtext

an R package for reading text files
https://readtext.quanteda.io
120 stars 28 forks source link

Fixed bug in get_csv #151

Closed JBGruber closed 5 years ago

JBGruber commented 5 years ago

readtext ignores encoding when trying to read in a csv file (see SO). The problem is that in get_csv, the encodings "Latin-1", "UTF-8" are basically ignored since the argument is missing in the respective call to data.table::fread. This is not a problem on Linux and Mac as far as I can see since these OS have sensible defaults.

This PR adds the correct argument to data.table::fread and provides a simple fix for the problem.

Here is some code to reproduce the addressed problem (will only cause issues on Windows):

df <- structure(list(c_text = structure(c(3L, 1L, 2L), .Label = c("Laïrie", 
                         "Mános", "München"), class = "factor"), c_id = structure(1:3, .Label = c("aa", 
                         "bb", "cc"), class = "factor")), class = "data.frame", row.names = c(NA, 
                         -3L))
write.csv(df,
          "~/test.csv",
          row.names = FALSE,
          fileEncoding = "UTF-8")
text_raw <- readtext::readtext("~/test.csv",
                               encoding = "UTF-8",
                               text_field = "c_text")

text_raw

Before:

text_raw
#> readtext object consisting of 3 documents and 1 docvar.
#> # Description: data.frame [3 x 3]
#>   doc_id     text              c_id 
#>   <chr>      <chr>             <chr>
#> 1 test.csv.1 "\"München\"..." aa   
#> 2 test.csv.2 "\"Laïrie\"..."  bb   
#> 3 test.csv.3 "\"Mános\"..."   cc

After

text_raw
#> readtext object consisting of 3 documents and 1 docvar.
#> # Description: data.frame [3 x 3]
#>   doc_id     text             c_id 
#>   <chr>      <chr>            <chr>
#> 1 test.csv.1 "\"München\"..." aa   
#> 2 test.csv.2 "\"Laïrie\"..."  bb   
#> 3 test.csv.3 "\"Mános\"..."   cc

Created on 2019-05-01 by the reprex package (v0.2.1)

kbenoit commented 5 years ago

Thanks @JBGruber !