ropensci / bibtex

bibtex parser for R
https://docs.ropensci.org/bibtex/
35 stars 12 forks source link

Encoding error when parsing BibTeX file with multi-byte characters on Windows #20

Open hongyuanjia opened 6 years ago

hongyuanjia commented 6 years ago

Thanks for this great package. I encountered a problem when using bibtex package to parse BibTeX files with Chinese characters on Windows:

# Get current locale info
Sys.getlocale()
#> [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

# Set locale to Chinese
Sys.setlocale(locale = "Chinese")
#> [1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified)_China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_China.936"

bib_text <- "
    @misc{text,
        title = {{你好}},
        language = {zh-CN},
        author = {{你好}},
        month = jun,
        year = {2013},
        pages = {163}
    }
"
# change encoding to "UTF-8"
bib_text_utf8 <- enc2utf8(bib_text)
Encoding(bib_text_utf8)
#> [1] "UTF-8"

# make sure the saved BibTeX file is UTF-8 encoded
con <- file("test.bib", encoding = "UTF-8")
writeLines(bib_text_utf8, con)
close(con)

readLines("test.bib", encoding = "UTF-8")
#> [1] ""                                "        @misc{text,"            
#> [3] "            title = {{你好}},"   "            language = {zh-CN},"
#> [5] "            author = {{你好}},"  "            month = jun,"       
#> [7] "            year = {2013},"      "            pages = {163}"      
#> [9] "        }"                       "    "                           

read.bib could not parse Chinese characters no matter encoding was set to "UTF-8" or not.

str(bibtex::read.bib("test.bib"))
#> List of 1
#>  $ text:Class 'bibentry'  hidden list of 1
#>   ..$ text:List of 6
#>   .. ..$ title   : chr "{浣犲ソ}"
#>   .. ..$ language: chr "zh-CN"
#>   .. ..$ author  :Class 'person'  hidden list of 1
#>   .. .. ..$ :List of 5
#>   .. .. .. ..$ given  : NULL
#>   .. .. .. ..$ family : chr "浣犲ソ"
#>   .. .. .. ..$ role   : NULL
#>   .. .. .. ..$ email  : NULL
#>   .. .. .. ..$ comment: NULL
#>   .. ..$ month   : chr "jun"
#>   .. ..$ year    : chr "2013"
#>   .. ..$ pages   : chr "163"
#>   .. ..- attr(*, "bibtype")= chr "Misc"
#>   .. ..- attr(*, "key")= chr "text"
#>  - attr(*, "class")= chr "bibentry"
#>  - attr(*, "strings")= Named chr(0) 
#>   ..- attr(*, "names")= chr(0) 

str(bibtex::read.bib("test.bib", encoding = "UTF-8"))
#> List of 1
#>  $ text:Class 'bibentry'  hidden list of 1
#>   ..$ text:List of 6
#>   .. ..$ title   : chr "{浣犲ソ}"
#>   .. ..$ language: chr "zh-CN"
#>   .. ..$ author  :Class 'person'  hidden list of 1
#>   .. .. ..$ :List of 5
#>   .. .. .. ..$ given  : NULL
#>   .. .. .. ..$ family : chr "浣犲ソ"
#>   .. .. .. ..$ role   : NULL
#>   .. .. .. ..$ email  : NULL
#>   .. .. .. ..$ comment: NULL
#>   .. ..$ month   : chr "jun"
#>   .. ..$ year    : chr "2013"
#>   .. ..$ pages   : chr "163"
#>   .. ..- attr(*, "bibtype")= chr "Misc"
#>   .. ..- attr(*, "key")= chr "text"
#>  - attr(*, "class")= chr "bibentry"
#>  - attr(*, "strings")= Named chr(0) 
#>   ..- attr(*, "names")= chr(0) 

Here is my session info

sessionInfo()
#> R version 3.5.0 (2018-04-23)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17134)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 

After digging a little bit, I found that encode the input of make.bib.entry to "UTF-8" can solve this problem. But I am not sure if this is a proper solution.

devtools::install_github("hongyuanjia/bibtex")
str(bibtex::read.bib("test.bib"))
#> List of 1
#>  $ text:Class 'bibentry'  hidden list of 1
#>   ..$ text:List of 6
#>   .. ..$ title   : chr "{你好}"
#>   .. ..$ language: chr "zh-CN"
#>   .. ..$ author  :Class 'person'  hidden list of 1
#>   .. .. ..$ :List of 5
#>   .. .. .. ..$ given  : NULL
#>   .. .. .. ..$ family : chr "你好"
#>   .. .. .. ..$ role   : NULL
#>   .. .. .. ..$ email  : NULL
#>   .. .. .. ..$ comment: NULL
#>   .. ..$ month   : chr "jun"
#>   .. ..$ year    : chr "2013"
#>   .. ..$ pages   : chr "163"
#>   .. ..- attr(*, "bibtype")= chr "Misc"
#>   .. ..- attr(*, "key")= chr "text"
#>  - attr(*, "class")= chr "bibentry"
#>  - attr(*, "strings")= Named chr(0) 
#>   ..- attr(*, "names")= chr(0) 
mrustl commented 5 years ago

I have a similar problem with my bib file (kwb_dummy.txt) on Windows:

### Importing file with default 
bibtex::read.bib(file = "kwb_dummy.txt")

Grützmacher G, Kumar P, Rustler M, Hannappel S, Sauer U (2013). “Geogenic
groundwater contamination – definition, occurrence and relevance for drinking
water production.” _Zbl. Geol. Paläont. Teil I_, *1*(1), 69-75.

### Setting encoding to UTF-8 does not change result
bibtex::read.bib(file = "kwb_dummy.txt", encoding = "UTF-8")
Grützmacher G, Kumar P, Rustler M, Hannappel S, Sauer U (2013). “Geogenic
groundwater contamination – definition, occurrence and relevance for drinking
water production.” _Zbl. Geol. Paläont. Teil I_, *1*(1), 69-75.

> bibtex::read.bib(file = "kwb_dummy.txt")

Grützmacher G, Kumar P, Rustler M, Hannappel S, Sauer U (2013). “Geogenic
groundwater contamination – definition, occurrence and relevance for drinking
water production.” _Zbl. Geol. Paläont. Teil I_, *1*(1), 69-75.

### Correct import with readLines
readLines("kwb_dummy.txt", n = 3, encoding = "UTF-8")
[1] "@article{RN7335,"                                                                                                     
[2] "   author = {Grützmacher, Gesche and Kumar, P.J.Sajil and Rustler, Michael and Hannappel, Stephan and Sauer, U.},"    
[3] "   title = {Geogenic groundwater contamination – definition, occurrence and relevance for drinking water production},"

### System
sessioninfo::session_info()
- Session info ----------------------------------------------------------------------------
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 os       Windows 7 x64 SP 1          
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_United Kingdom.1252 
 ctype    English_United Kingdom.1252 
 tz       Europe/Berlin               
 date     2018-12-11                  

- Packages --------------------------------------------------------------------------------
 package     * version date       lib source        
 assertthat    0.2.0   2017-04-11 [1] CRAN (R 3.5.0)
 bibtex        0.4.2   2017-06-30 [1] CRAN (R 3.5.1)
 cli           1.0.1   2018-09-25 [1] CRAN (R 3.5.1)
 crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.0)
 digest        0.6.18  2018-10-10 [1] CRAN (R 3.5.1)
 evaluate      0.12    2018-10-09 [1] CRAN (R 3.5.1)
 htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.0)
 httr          1.3.1   2017-08-20 [1] CRAN (R 3.5.0)
 jsonlite      1.6     2018-12-07 [1] CRAN (R 3.5.1)
 knitr         1.20    2018-02-20 [1] CRAN (R 3.5.0)
 lubridate     1.7.4   2018-04-11 [1] CRAN (R 3.5.0)
 magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.1)
 packrat       0.4.9-3 2018-06-01 [1] CRAN (R 3.5.1)
 plyr          1.8.4   2016-06-08 [1] CRAN (R 3.5.1)
 R6            2.3.0   2018-10-04 [1] CRAN (R 3.5.1)
 Rcpp          1.0.0   2018-11-07 [1] CRAN (R 3.5.0)
 RefManageR    1.2.0   2018-04-25 [1] CRAN (R 3.5.1)
 rmarkdown     1.11    2018-12-08 [1] CRAN (R 3.5.1)
 rstudioapi    0.8     2018-10-02 [1] CRAN (R 3.5.1)
 sessioninfo   1.1.0   2018-09-25 [1] CRAN (R 3.5.1)
 stringi       1.2.4   2018-07-20 [1] CRAN (R 3.5.1)
 stringr       1.3.1   2018-05-10 [1] CRAN (R 3.5.1)
 withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.0)
 xml2          1.2.0   2018-01-24 [1] CRAN (R 3.5.1)

[1] C:/Users/mrustl.KWB/Documents/R/win-library/3.5
[2] C:/Program Files/R/R-3.5.1/library
GegznaV commented 4 years ago

I can still confirm that there is an encoding issue in bibtex::do_read_bib() and bibtex::read.bib() on Windows:

file <- "book.bib"
encoding <- "UTF-8"
out <- bibtex::do_read_bib(file, encoding = encoding, srcfile(file, encoding = encoding))
out[[1]]

##                                                      address 
##                                                      "Vilnius" 
##                                                         author 
##   "{\\v{C}}ekanavi{\\v{c}}ius, Vydas and Murauskas, Gediminas" 
##                                                          title 
##      "{Taikomoji regresinÄ— analizÄ— socialiniuose tyrimuose}" 

The contents of "book.bib" file:

@book{Cekanavicius2014,
    address = {Vilnius},
    author = {{\v{C}}ekanavi{\v{c}}ius, Vydas and Murauskas, Gediminas},
    title = {{Taikomoji regresinė analizė socialiniuose tyrimuose}},
    year = {2014}
}

An RStudio project for further experimentation: bib-file--UTF-8--issue.zip

@romainfrancois It is quite an old issue. What can be done towards solving it? The solution to this issue would also solve some issues in packages that depend on bibtex including ropensci/RefManageR#66 or crsh/citr#67

hongyuanjia commented 3 years ago

Some findings on this:

bibtex::read.bib() is able to read bib files on Windows if bib files were written with native.enc encoding:

Sys.setlocale(locale = "Chinese")
#> [1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified)_China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_China.936"

bib_text <-
"
@misc{text,
    title = {{你好}},
    author = {{你好}},
    year = 2020
}
"

# native encoding which is the default on Windows
options(encoding = "native.enc")
writeLines(bib_text, "native.enc.bib")

readLines("native.enc.bib")
# [1] ""                       "@misc{text,"
# [3] "    title = {{你好}},"  "    author = {{你好}},"
# [5] "    year = 2020"        "}"
# [7] ""

# default encoding option "unknown" which is equivalent to "native.enc"
bibtex::read.bib("native.enc.bib", encoding = "unknown") 
# 你好 (2020). "你好."

bibtex::read.bib() is not able to read bib files on Windows if bib files were written with UTF-8 encoding:

# UTF-8 encoding
# NOTE:
# 'native.enc' encoding option is still necessary on Windows to ensure
# writing as UTF-8. useBytes should also set to TRUE to prevent re-encoding the
# text in the file() connection in writeLines()
# See https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/
# and https://github.com/yihui/xfun/blob/12e77f58cbee106bfdfb0b288282f47cbf537937/R/io.R#L32
options(encoding = 'native.enc')
writeLines(enc2utf8(bib_text), "utf8.bib", useBytes = TRUE)

readLines("utf8.bib", encoding = "UTF-8")
# [1] ""                           "    @misc{text,"
# [3] "        title = {{你好}},"  "        author = {{你好}},"
# [5] "        year = 2020"        "    }"
# [7] ""

bibtex::read.bib("utf8.bib", encoding = "UTF-8")
# 浣犲ソ (2020). "浣犲ソ

The issue here is that even UTF-8 is selected for the encoding, what bibtex::do_read_bib() still return parsed text as native encoded:

out_native.enc <- .External( "do_read_bib", file = "native.enc.bib", encoding = "unknown", srcfile = srcfile("native.enc.bib", "native.enc") )
out_native.enc
# [[1]]
#    title   author     year 
# "{你好}" "{你好}"   "2020" 
# attr(,"entry")
# [1] "misc"
# attr(,"key")
# [1] "text"
# 
# attr(,"include")
# character(0)
# attr(,"strings")
# named character(0)
# attr(,"preamble")
# character(0)

# native encoded which is expected
lapply(out_native.enc, Encoding)
# [[1]]
# [1] "unknown" "unknown" "unknown"
#

out_utf8 <- .External( "do_read_bib", file = "utf8.bib", encoding = "UTF-8", srcfile = srcfile("utf8.bib", "UTF-8") )
out_utf8
# [[1]]
#      title     author       year
# "{浣犲ソ}" "{浣犲ソ}"     "2020"
# attr(,"entry")
# [1] "misc"
# attr(,"key")
# [1] "text"
#
# attr(,"include")
# character(0)
# attr(,"strings")
# named character(0)
# attr(,"preamble")
# character(0)

# this is also native encoded
lapply(out_utf8, Encoding)
# [[1]]
# [1] "unknown" "unknown" "unknown"
#

Force the encoding to UTF-8 can fix this issue.

# change to UTF-8
lapply(out_utf8, `Encoding<-`, "UTF-8")
# [[1]]
#    title   author     year
# "{你好}" "{你好}"   "2020"
# attr(,"entry")
# [1] "misc"
# attr(,"key")
# [1] "text"
#

Since the do_read_bib() is written in C, it is possible that the default encoding of the input stream is set to "C" locale and fall back to native encoding on Windows. Unfortunately I knew little about C, this is just my guess. This may be verified by changing the encoding option for do_read_bib() and it results in the same parsed tests and encoding:

Encoding(.External( "do_read_bib", file = "native.enc.bib", encoding = "latin1", srcfile = srcfile("native.enc.bib", "native.enc"))[[1]])
# [1] "unknown" "unknown" "unknown"

So in summary, on Windows, it is better to always use native.enc. For those downstream packages that use bibtex::do_read_bib() such as RefManageR::ReadBib(), the default encoding should be set to unknown instead of UTF-8.

I will send a PR to provide a possible fix on the R side.