r-dbi / odbc

Connect to ODBC databases (using the DBI interface)
https://odbc.r-dbi.org/
Other
388 stars 107 forks source link

MS Access encoding error: result_fetch() failed with Incomplete multibyte sequence #273

Open madlogos opened 5 years ago

madlogos commented 5 years ago

Issue Description and Expected Result

dbGetQuery() gets "Error in result_fetch(res@ptr, n, ...) : Incomplete multibyte sequence" when accessing a MS Access database with GBK/CP936 (Chinese) encoding.

It looks quite normal in Access and I can successfully read it into R using RODBC::sqlQuery(). But when I run the following codes, it failed with warning "Error in result_fetch(res@ptr, n, ...) : Incomplete multibyte sequence".

Database

MS Access 2016 (64-bit)

$dbms.name
[1] "ACCESS"

$db.version
[1] "12.00.0000"

$username
[1] "admin"

$host
[1] ""

$port
[1] ""

$sourcename
[1] ""

$servername
[1] "ACCESS"

$drivername
[1] "ACEODBC.DLL"

$odbc.version
[1] "03.80.0000"

$driver.version
[1] "Microsoft Access database engine"

$odbcdriver.version
[1] "03.51"

$supports.transactions
[1] FALSE

Reproducible Example

Testcase: test.zip

By the following codes, you will get the multibyte incomplete error.

library(dplyr)
con <- DBI::dbConnect(
    odbc::odbc(), 
    .connection_string=paste0("Driver={Microsoft Access Driver (*.mdb, *.accdb)};
                               Dbq=<testcase .accdb file>"), 
    encoding="CP936")
tbl(con, "Table1") %>% select(Place) %>% collect 

If you change encoding="CP936" to "UTF-8",

con <- DBI::dbConnect(
    odbc::odbc(), 
    .connection_string=paste0("Driver={Microsoft Access Driver (*.mdb, *.accdb)};
                               Dbq=<a .accdb file>"), 
    encoding="UTF-8")
tbl(con, "Table1") %>% select(Place) %>% collect 

you will get the following

# A tibble: 3 x 1
  Place                                                      
  <chr>                                                      
1 "\xd5\xe3\xbd\xad-\xba\xbc\xd6\xdd-\xce\xf7\xcf\xaa"       
2 "\xb1\xb1\xbe\xa9-\xb9\xfa\xbc\xd2\xb9\xe3\xb8\xe6\xb2\xfa"
3 "\xd5\xe3\xbd\xad-\xba\xbc\xd6\xdd-\xce\xf7\xcf\xaa\xfa" 

Row 1 and 3 should have been identical, but in fact row 3 has 15 bytes, which I guess triggered the error.

Additions:

Session Info ```r devtools::session_info() #> - Session info ------------------------------------------------------------------------------------------------ setting value version R version 3.5.3 (2019-03-11) os Windows >= 8 x64 system x86_64, mingw32 ui RStudio language (EN) collate English_United States.1252 ctype Chinese (Simplified)_China.936 tz Asia/Taipei date 2019-04-17 - Packages ---------------------------------------------------------------------------------------------------- ! package * version date lib source askpass * 1.1 2019-01-13 [1] CRAN (R 3.5.2) assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.5.3) backports 1.1.4 2019-04-10 [1] CRAN (R 3.5.3) base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.5.0) bit 1.1-14 2018-05-29 [1] CRAN (R 3.5.0) bit64 0.9-7 2017-05-08 [1] CRAN (R 3.5.0) blob 1.1.1 2018-03-25 [1] CRAN (R 3.4.4) broom * 0.5.2 2019-04-07 [1] CRAN (R 3.5.3) callr 3.2.0 2019-03-15 [1] CRAN (R 3.5.3) cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.4.0) class 7.3-15 2019-01-01 [1] CRAN (R 3.5.3) classInt 0.3-1 2018-12-18 [1] CRAN (R 3.5.3) cli 1.1.0 2019-03-19 [1] CRAN (R 3.5.3) colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.5.2) crayon 1.3.4 2017-09-16 [1] CRAN (R 3.4.1) data.table * 1.12.2 2019-04-07 [1] CRAN (R 3.5.3) DBI * 1.0.0 2018-05-02 [1] CRAN (R 3.5.0) dbplyr 1.3.0 2019-01-09 [1] CRAN (R 3.5.2) desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.0) devtools 2.0.2 2019-04-08 [1] CRAN (R 3.5.3) digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.1) dplyr * 0.8.0.1 2019-02-15 [1] CRAN (R 3.5.2) e1071 1.7-1 2019-03-19 [1] CRAN (R 3.5.3) extrafont * 0.17 2014-12-08 [1] CRAN (R 3.4.0) extrafontdb 1.0 2012-06-11 [1] CRAN (R 3.4.0) fansi 0.4.0 2018-10-05 [1] CRAN (R 3.5.1) foreign 0.8-71 2018-07-20 [1] CRAN (R 3.5.3) fs 1.2.7 2019-03-19 [1] CRAN (R 3.5.3) generics 0.0.2 2018-11-29 [1] CRAN (R 3.5.1) ggplot2 * 3.1.1 2019-04-07 [1] CRAN (R 3.5.3) ggthemes * 4.1.1 2019-04-09 [1] CRAN (R 3.5.3) glue * 1.3.1 2019-03-12 [1] CRAN (R 3.5.3) gtable 0.3.0 2019-03-25 [1] CRAN (R 3.5.3) hms 0.4.2 2018-03-10 [1] CRAN (R 3.4.3) htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.0) lattice 0.20-38 2018-11-04 [1] CRAN (R 3.5.3) lazyeval 0.2.2 2019-03-15 [1] CRAN (R 3.5.3) magrittr 1.5 2014-11-22 [1] CRAN (R 3.4.0) maptools 0.9-5 2019-02-18 [1] CRAN (R 3.5.2) memoise 1.1.0 2017-04-21 [1] CRAN (R 3.4.0) munsell 0.5.0 2018-06-12 [1] CRAN (R 3.5.0) nlme 3.1-139 2019-04-09 [1] CRAN (R 3.5.3) odbc * 1.1.6 2018-06-09 [1] CRAN (R 3.5.0) officer * 0.3.3 2019-03-01 [1] CRAN (R 3.5.2) openxlsx * 4.1.0 2018-05-26 [1] CRAN (R 3.5.0) pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.1) pkgbuild 1.0.3 2019-03-20 [1] CRAN (R 3.5.3) pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.1) pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.1) plyr 1.8.4 2016-06-08 [1] CRAN (R 3.5.0) prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.4.0) processx 3.3.0 2019-03-10 [1] CRAN (R 3.5.2) ps 1.3.0 2018-12-21 [1] CRAN (R 3.5.2) purrr 0.3.2 2019-03-15 [1] CRAN (R 3.5.3) R6 2.4.0 2019-02-14 [1] CRAN (R 3.5.2) Rcpp 1.0.1 2019-03-17 [1] CRAN (R 3.5.3) readr * 1.3.1 2018-12-21 [1] CRAN (R 3.5.2) readxl * 1.3.1 2019-03-13 [1] CRAN (R 3.5.3) remotes 2.0.4 2019-04-10 [1] CRAN (R 3.5.3) reshape2 1.4.3 2017-12-11 [1] CRAN (R 3.5.0) rgdal * 1.4-3 2019-03-14 [1] CRAN (R 3.5.3) rgeos * 0.4-2 2018-11-08 [1] CRAN (R 3.5.1) rlang 0.3.4 2019-04-07 [1] CRAN (R 3.5.3) RODBC * 1.3-15 2017-04-13 [1] CRAN (R 3.5.0) rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.4.3) rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.5.3) Rttf2pt1 1.3.7 2018-06-29 [1] CRAN (R 3.5.0) scales 1.0.0 2018-08-09 [1] CRAN (R 3.5.1) sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.1) sf * 0.7-3 2019-02-21 [1] CRAN (R 3.5.3) sp * 1.3-1 2018-06-05 [1] CRAN (R 3.5.0) stringi 1.4.3 2019-03-12 [1] CRAN (R 3.5.3) stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.5.2) testthat 2.0.1 2018-10-13 [1] CRAN (R 3.5.1) tibble 2.1.1 2019-03-16 [1] CRAN (R 3.5.3) tidyr * 0.8.3 2019-03-01 [1] CRAN (R 3.5.2) tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.1) units 0.6-2 2018-12-05 [1] CRAN (R 3.5.1) usethis 1.5.0 2019-04-07 [1] CRAN (R 3.5.3) utf8 1.1.4 2018-05-24 [1] CRAN (R 3.5.0) uuid 0.1-2 2015-07-28 [1] CRAN (R 3.5.0) withr 2.1.2 2018-03-15 [1] CRAN (R 3.4.4) xml2 1.2.0 2018-01-24 [1] CRAN (R 3.5.0) yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.1) zip 2.0.1 2019-03-11 [1] CRAN (R 3.5.3) ```
madlogos commented 5 years ago

Well, finally I found a way out, which, however, triggered another issue.

  1. Do not set encoding explicitly in dbConnect
con <- DBI::dbConnect(
    odbc::odbc(), 
    .connection_string=paste0("Driver={Microsoft Access Driver (*.mdb, *.accdb)};
                               Dbq=<a .accdb file>"))
  1. Transform the encoding in the results
rslt <- tbl(con, "Table1") %>% select(Place) %>% collect
rslt[] = sapply(rslt, function(col) {
    if (is.character(col)){
        stringi::stri_encode(col, from="CP936", to="UTF-8") 
    }else{
        col
    }
})

Finally gets a tibble as below:

# A tibble: 3 x 1
  Place[,"Place"]
  <chr>          
1 浙江-杭州-西溪 
2 北京-国家广告产
3 浙江-杭州-西溪�

However, it should have been

# A tibble: 3 x 1
  Place[,"Place"]
  <chr>          
1 浙江-杭州-西溪园区
2 北京-国家广告产业园
3 浙江-杭州-西溪园区

It truncated the tail bytes beyond 15 in each cell, since the field size was set to 15. But as you may know, Chinese is a multibyte language, so each Chinese character has a length of 2 bytes. That's why the function behaved in such weird way at the very beginning.

What should I do (e.g., add some argument, or do some additional work) to extract all the characters from the database? I checked the odbc::result_fetch() function, and it is calling the _odbc_result_fetch method in nanodbc, so it seems I don't have a chance to adjust the arguments.