r-dbi / odbc

Connect to ODBC databases (using the DBI interface)
https://odbc.r-dbi.org/
Other
388 stars 107 forks source link

Encoding/Collation problems with 1.5.0 SQL server #834

Open soetang opened 3 weeks ago

soetang commented 3 weeks ago

We have been expriencing encoding problems with odbc 1.5.0 on SQL server. This works fine on odbc 1.4.2

All our databases unfortunately uses the varchar collation "Danish_Norwegian_CI_AS" - with odbc 1.4.2 however we were able to create a connection so that we could correctly read and write to the database. With odbc 1.5.0 the column names are no longer formatted correctly.

Is this a bug or can we change the connection settings so that it works correctly?

library('magrittr')
conn <- DBI::dbConnect(
  odbc::odbc()
  , Driver = "ODBC Driver 17 for SQL Server"
  , Server = "server"
  , Database = 'database'
  , Trusted_Connection = "Yes"
  , encoding = 'latin1'
  , AutoTranslate = 'no'
)

test_table <- DBI::Id(schema = 'test', table = 'test_table')
df <- tibble::tibble(
  var_char_col = c('kanin', 'ræven', 'ålens', 'ørred'),
  bøvs = 1
)

DBI::dbWriteTable(
  conn
  , name = test_table
  , df
  , field.types = c('var_char_col' = 'varchar(5)')
  , overwrite = TRUE
)

db_data <- DBI::dbReadTable(conn, test_table) %>%
  tibble::as_tibble()

db_data
#> # A tibble: 4 × 2
#>   var_char_col bÃ.vs
#>   <chr>        <dbl>
#> 1 kanin            1
#> 2 ræven            1
#> 3 ålens            1
#> 4 ørred            1

Expected value:

db_data
#> # A tibble: 4 × 2
#>   var_char_col  bøvs
#>   <chr>        <dbl>
#> 1 kanin            1
#> 2 ræven            1
#> 3 ålens            1
#> 4 ørred            1

If we remove the encoding parameter - we get correct colum names, however the character vector no longer fit within the varchar(5) datatype even though we only have 5 characters:

library('magrittr')
conn <- DBI::dbConnect(
  odbc::odbc()
  , Driver = "ODBC Driver 17 for SQL Server"
  , Server = "server"
  , Database = 'database'
  , Trusted_Connection = "Yes"
  # , encoding = 'latin1'
  , AutoTranslate = 'no'
)

test_table <- DBI::Id(schema = 'test', table = 'test_table')
df <- tibble::tibble(
  var_char_col = c('kanin', 'ræven', 'ålens', 'ørred'),
  bøvs = 1
)

DBI::dbWriteTable(
  conn
  , name = test_table
  , df
  , field.types = c('var_char_col' = 'varchar(5)')
  , overwrite = TRUE
)
#> Error in eval(expr, envir, enclos): nanodbc/nanodbc.cpp:1783: 00000
#> [Microsoft][ODBC Driver 17 for SQL Server]String data, right truncation

db_data <- DBI::dbReadTable(conn, test_table) %>%
  tibble::as_tibble()

db_data
#> # A tibble: 0 × 2
#> # ℹ 2 variables: var_char_col <chr>, bøvs <dbl>
simonpcouch commented 3 weeks ago

One encoding-specific set of changes that made it into 1.5.0 was #531, and they look like they're likely the culprit. @shrektan, any thoughts on which connection parameters should folks now set to see the correct behavior? Or is this unexpected?

detule commented 2 weeks ago

Hi there.

While waiting for @shrektan to chime in. I am unable to replicate the failure with your second example on my setup. Can you think of how your and my setups are different?

> df <- data.frame(
    var_char_col = c('kanin', 'ræven', 'ålens', 'ørred'),
    bøvs = 1
  )
> df
  var_char_col bøvs
1        kanin    1
2        ræven    1
3        ålens    1
4        ørred    1
> DBI::dbWriteTable(conn, name = "#deleteme", df, field.types = c('var_char_col' = 'varchar(5)'), overwrite = TRUE)
> DBI::dbReadTable(conn, name = "#deleteme")
  var_char_col bøvs
1        kanin    1
2        ræven    1
3        ålens    1
4        ørred    1

 > conn@encoding
[1] ""
> conn@info
...
$dbms.name
[1] "Microsoft SQL Server"
$db.version
[1] "15.00.4249"
...
$drivername
[1] "libmsodbcsql-17.10.so.6.1"
...
$driver.version
[1] "17.10.0006"
> sessionInfo()$otherPkgs$odbc$Version
[1] "1.5.0.9000"
> sessionInfo()$R.version[c("platform", "arch", "version.string")]
$platform
[1] "x86_64-pc-linux-gnu"

$arch
[1] "x86_64"

$version.string
[1] "R version 4.2.1 (2022-06-23)"

Completely shooting off the hip - are you working against an older version of SQL Server? I think the ability to store UTF-8 encoded characters in VARCHAR columns is only supported starting with SQL Server 2019. If your version is older, and you are not specifying an encoding to the connection argument ( as in your second example ), I suspect you would have success declaring the column NVARCHAR. Only guessing at this point, however.

Need to think how this changed between 1.4.2 and 1.5.0, and how/why your first example worked then and no longer works today.

detule commented 2 weeks ago

Nevermind @soetang:

Ignore the note above; after creating a catalog with your collation, I am able to replicate your second example as well.

soetang commented 2 weeks ago

Thanks for looking into it. Tested some more and teoretically i can get it to work by converting the data myself to and from latin1 and by not providing the encoding argument. This however is what I would expect happened internally when i provide the encoding.

library('magrittr')
conn <- DBI::dbConnect(
  odbc::odbc()
  , Driver = "ODBC Driver 17 for SQL Server"
  , Server = "server"
  , Database = 'database'
  , Trusted_Connection = "Yes"
  #, encoding = 'latin1'
  , AutoTranslate = 'no'
)

test_table <- DBI::Id(schema = 'test', table = 'test_table')
df <- tibble::tibble(
  var_char_col = c('kanin', 'ræven', 'ålens', 'ørred'),
  bøvs = 1
)
df$var_char_col <- stringi::stri_encode(df$var_char_col, to = 'LATIN1')

DBI::dbWriteTable(
  , conn =conn
  , name = test_table
  , value = df
  , field.types = c('var_char_col' = 'varchar(5)')
  , overwrite = TRUE
)

db_data <- DBI::dbReadTable(conn, test_table) 

db_data
#>   var_char_col bøvs
#> 1        kanin    1
#> 2     r\xe6ven    1
#> 3     \xe5lens    1
#> 4     \xf8rred    1

db_data$var_char_col <- stringi::stri_encode(
  db_data$var_char_col
  , from = 'latin1'
  , to = 'UTF-8'
)
db_data
#>   var_char_col bøvs
#> 1        kanin    1
#> 2        ræven    1
#> 3        ålens    1
#> 4        ørred    1

We are using SQL server 2019:

conn@info
...
#> $dbms.name
#> [1] "Microsoft SQL Server"
#> 
#> $db.version
#> [1] "15.00.4375"
....
#> $drivername
#> [1] "libmsodbcsql-17.10.so.6.1"
#> 
...
#> $driver.version
#> [1] "17.10.0006"
#> 
....
detule commented 2 weeks ago

Thanks for the investigation.

Here's some more notes from me:

soetang commented 2 weeks ago

So I tried writing data while only including the encoding parameter (Without AutoTranslate). This did not end up in a correct result:

image

The same time I just reran my second example but with varchar(6). And while it from R makes both writing and retrieving look correct - the actual data in the database is incorrect:

image

R result:


  library('magrittr')
  conn <- DBI::dbConnect(
    odbc::odbc()
    , Driver = "ODBC Driver 17 for SQL Server"
    , Server = "server"
    , Database = 'database'
    , Trusted_Connection = "Yes"
    # , encoding = 'latin1'
    , AutoTranslate = 'no'
  )

  test_table <- DBI::Id(schema = 'test', table = 'test_table')
  df <- tibble::tibble(
    var_char_col = c('kanin', 'ræven', 'ålens', 'ørred'),
    bøvs = 1
  )

  DBI::dbWriteTable(
    conn
    , name = test_table
    , df
    , field.types = c('var_char_col' = 'varchar(6)')
    , overwrite = TRUE
  )

  db_data <- DBI::dbReadTable(conn, test_table) 

  db_data
#>   var_char_col bøvs
#> 1        kanin    1
#> 2        ræven    1
#> 3        ålens    1
#> 4        ørred    1
soetang commented 2 weeks ago

However I can confirm that for reading the data - it works fine without the encoding and autotranslate arguments.

detule commented 1 week ago

Thanks @soetang

Do you have the ability to test development branches? If so can you see if with https://github.com/r-dbi/odbc/pull/845 we have restored the prior behavior ( everything works with the encoding and AutoTranslate arguments as in your original post ).

soetang commented 1 week ago

I am off the next days - but have asked my team if they can test it. Thx. Alot.

soetang commented 3 days ago

@detule I can confirm that it works with your branch.