Open hongyuanjia opened 6 years ago
I have a similar problem with my bib file (kwb_dummy.txt) on Windows:
### Importing file with default
bibtex::read.bib(file = "kwb_dummy.txt")
Grützmacher G, Kumar P, Rustler M, Hannappel S, Sauer U (2013). “Geogenic
groundwater contamination – definition, occurrence and relevance for drinking
water production.” _Zbl. Geol. Paläont. Teil I_, *1*(1), 69-75.
### Setting encoding to UTF-8 does not change result
bibtex::read.bib(file = "kwb_dummy.txt", encoding = "UTF-8")
Grützmacher G, Kumar P, Rustler M, Hannappel S, Sauer U (2013). “Geogenic
groundwater contamination – definition, occurrence and relevance for drinking
water production.” _Zbl. Geol. Paläont. Teil I_, *1*(1), 69-75.
> bibtex::read.bib(file = "kwb_dummy.txt")
Grützmacher G, Kumar P, Rustler M, Hannappel S, Sauer U (2013). “Geogenic
groundwater contamination – definition, occurrence and relevance for drinking
water production.” _Zbl. Geol. Paläont. Teil I_, *1*(1), 69-75.
### Correct import with readLines
readLines("kwb_dummy.txt", n = 3, encoding = "UTF-8")
[1] "@article{RN7335,"
[2] " author = {Grützmacher, Gesche and Kumar, P.J.Sajil and Rustler, Michael and Hannappel, Stephan and Sauer, U.},"
[3] " title = {Geogenic groundwater contamination – definition, occurrence and relevance for drinking water production},"
### System
sessioninfo::session_info()
- Session info ----------------------------------------------------------------------------
setting value
version R version 3.5.1 (2018-07-02)
os Windows 7 x64 SP 1
system x86_64, mingw32
ui RStudio
language (EN)
collate English_United Kingdom.1252
ctype English_United Kingdom.1252
tz Europe/Berlin
date 2018-12-11
- Packages --------------------------------------------------------------------------------
package * version date lib source
assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.0)
bibtex 0.4.2 2017-06-30 [1] CRAN (R 3.5.1)
cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.1)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.0)
digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.1)
evaluate 0.12 2018-10-09 [1] CRAN (R 3.5.1)
htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.0)
httr 1.3.1 2017-08-20 [1] CRAN (R 3.5.0)
jsonlite 1.6 2018-12-07 [1] CRAN (R 3.5.1)
knitr 1.20 2018-02-20 [1] CRAN (R 3.5.0)
lubridate 1.7.4 2018-04-11 [1] CRAN (R 3.5.0)
magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.1)
packrat 0.4.9-3 2018-06-01 [1] CRAN (R 3.5.1)
plyr 1.8.4 2016-06-08 [1] CRAN (R 3.5.1)
R6 2.3.0 2018-10-04 [1] CRAN (R 3.5.1)
Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.0)
RefManageR 1.2.0 2018-04-25 [1] CRAN (R 3.5.1)
rmarkdown 1.11 2018-12-08 [1] CRAN (R 3.5.1)
rstudioapi 0.8 2018-10-02 [1] CRAN (R 3.5.1)
sessioninfo 1.1.0 2018-09-25 [1] CRAN (R 3.5.1)
stringi 1.2.4 2018-07-20 [1] CRAN (R 3.5.1)
stringr 1.3.1 2018-05-10 [1] CRAN (R 3.5.1)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.0)
xml2 1.2.0 2018-01-24 [1] CRAN (R 3.5.1)
[1] C:/Users/mrustl.KWB/Documents/R/win-library/3.5
[2] C:/Program Files/R/R-3.5.1/library
I can still confirm that there is an encoding issue in bibtex::do_read_bib()
and bibtex::read.bib()
on Windows:
file <- "book.bib"
encoding <- "UTF-8"
out <- bibtex::do_read_bib(file, encoding = encoding, srcfile(file, encoding = encoding))
out[[1]]
## address
## "Vilnius"
## author
## "{\\v{C}}ekanavi{\\v{c}}ius, Vydas and Murauskas, Gediminas"
## title
## "{Taikomoji regresinÄ— analizÄ— socialiniuose tyrimuose}"
The contents of "book.bib" file:
@book{Cekanavicius2014,
address = {Vilnius},
author = {{\v{C}}ekanavi{\v{c}}ius, Vydas and Murauskas, Gediminas},
title = {{Taikomoji regresinė analizė socialiniuose tyrimuose}},
year = {2014}
}
An RStudio project for further experimentation: bib-file--UTF-8--issue.zip
@romainfrancois It is quite an old issue. What can be done towards solving it? The solution to this issue would also solve some issues in packages that depend on bibtex including ropensci/RefManageR#66 or crsh/citr#67
Some findings on this:
bibtex::read.bib()
is able to read bib files on Windows if bib files were written with native.enc
encoding:Sys.setlocale(locale = "Chinese")
#> [1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified)_China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_China.936"
bib_text <-
"
@misc{text,
title = {{你好}},
author = {{你好}},
year = 2020
}
"
# native encoding which is the default on Windows
options(encoding = "native.enc")
writeLines(bib_text, "native.enc.bib")
readLines("native.enc.bib")
# [1] "" "@misc{text,"
# [3] " title = {{你好}}," " author = {{你好}},"
# [5] " year = 2020" "}"
# [7] ""
# default encoding option "unknown" which is equivalent to "native.enc"
bibtex::read.bib("native.enc.bib", encoding = "unknown")
# 你好 (2020). "你好."
bibtex::read.bib()
is not able to read bib files on Windows if bib files were written with UTF-8
encoding:# UTF-8 encoding
# NOTE:
# 'native.enc' encoding option is still necessary on Windows to ensure
# writing as UTF-8. useBytes should also set to TRUE to prevent re-encoding the
# text in the file() connection in writeLines()
# See https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/
# and https://github.com/yihui/xfun/blob/12e77f58cbee106bfdfb0b288282f47cbf537937/R/io.R#L32
options(encoding = 'native.enc')
writeLines(enc2utf8(bib_text), "utf8.bib", useBytes = TRUE)
readLines("utf8.bib", encoding = "UTF-8")
# [1] "" " @misc{text,"
# [3] " title = {{你好}}," " author = {{你好}},"
# [5] " year = 2020" " }"
# [7] ""
bibtex::read.bib("utf8.bib", encoding = "UTF-8")
# 浣犲ソ (2020). "浣犲ソ
The issue here is that even UTF-8
is selected for the encoding, what bibtex::do_read_bib()
still return parsed text as native encoded:
out_native.enc <- .External( "do_read_bib", file = "native.enc.bib", encoding = "unknown", srcfile = srcfile("native.enc.bib", "native.enc") )
out_native.enc
# [[1]]
# title author year
# "{你好}" "{你好}" "2020"
# attr(,"entry")
# [1] "misc"
# attr(,"key")
# [1] "text"
#
# attr(,"include")
# character(0)
# attr(,"strings")
# named character(0)
# attr(,"preamble")
# character(0)
# native encoded which is expected
lapply(out_native.enc, Encoding)
# [[1]]
# [1] "unknown" "unknown" "unknown"
#
out_utf8 <- .External( "do_read_bib", file = "utf8.bib", encoding = "UTF-8", srcfile = srcfile("utf8.bib", "UTF-8") )
out_utf8
# [[1]]
# title author year
# "{浣犲ソ}" "{浣犲ソ}" "2020"
# attr(,"entry")
# [1] "misc"
# attr(,"key")
# [1] "text"
#
# attr(,"include")
# character(0)
# attr(,"strings")
# named character(0)
# attr(,"preamble")
# character(0)
# this is also native encoded
lapply(out_utf8, Encoding)
# [[1]]
# [1] "unknown" "unknown" "unknown"
#
Force the encoding to UTF-8
can fix this issue.
# change to UTF-8
lapply(out_utf8, `Encoding<-`, "UTF-8")
# [[1]]
# title author year
# "{你好}" "{你好}" "2020"
# attr(,"entry")
# [1] "misc"
# attr(,"key")
# [1] "text"
#
Since the do_read_bib()
is written in C, it is possible that the default encoding of the input stream is set to "C" locale and fall back to native encoding on Windows. Unfortunately I knew little about C, this is just my guess. This may be verified by changing the encoding option for do_read_bib()
and it results in the same parsed tests and encoding:
Encoding(.External( "do_read_bib", file = "native.enc.bib", encoding = "latin1", srcfile = srcfile("native.enc.bib", "native.enc"))[[1]])
# [1] "unknown" "unknown" "unknown"
So in summary, on Windows, it is better to always use native.enc
. For those downstream packages that use bibtex::do_read_bib()
such as RefManageR::ReadBib()
, the default encoding should be set to unknown
instead of UTF-8
.
I will send a PR to provide a possible fix on the R side.
Thanks for this great package. I encountered a problem when using bibtex package to parse BibTeX files with Chinese characters on Windows:
read.bib
could not parse Chinese characters no matter encoding was set to "UTF-8" or not.Here is my session info
After digging a little bit, I found that encode the input of make.bib.entry to "UTF-8" can solve this problem. But I am not sure if this is a proper solution.