Open soetang opened 3 weeks ago
One encoding-specific set of changes that made it into 1.5.0 was #531, and they look like they're likely the culprit. @shrektan, any thoughts on which connection parameters should folks now set to see the correct behavior? Or is this unexpected?
Hi there.
While waiting for @shrektan to chime in. I am unable to replicate the failure with your second example on my setup. Can you think of how your and my setups are different?
> df <- data.frame(
var_char_col = c('kanin', 'ræven', 'ålens', 'ørred'),
bøvs = 1
)
> df
var_char_col bøvs
1 kanin 1
2 ræven 1
3 ålens 1
4 ørred 1
> DBI::dbWriteTable(conn, name = "#deleteme", df, field.types = c('var_char_col' = 'varchar(5)'), overwrite = TRUE)
> DBI::dbReadTable(conn, name = "#deleteme")
var_char_col bøvs
1 kanin 1
2 ræven 1
3 ålens 1
4 ørred 1
> conn@encoding
[1] ""
> conn@info
...
$dbms.name
[1] "Microsoft SQL Server"
$db.version
[1] "15.00.4249"
...
$drivername
[1] "libmsodbcsql-17.10.so.6.1"
...
$driver.version
[1] "17.10.0006"
> sessionInfo()$otherPkgs$odbc$Version
[1] "1.5.0.9000"
> sessionInfo()$R.version[c("platform", "arch", "version.string")]
$platform
[1] "x86_64-pc-linux-gnu"
$arch
[1] "x86_64"
$version.string
[1] "R version 4.2.1 (2022-06-23)"
Completely shooting off the hip - are you working against an older version of SQL Server? I think the ability to store UTF-8 encoded characters in VARCHAR
columns is only supported starting with SQL Server 2019. If your version is older, and you are not specifying an encoding
to the connection
argument ( as in your second example ), I suspect you would have success declaring the column NVARCHAR
. Only guessing at this point, however.
Need to think how this changed between 1.4.2 and 1.5.0, and how/why your first example worked then and no longer works today.
Nevermind @soetang:
Ignore the note above; after creating a catalog with your collation, I am able to replicate your second example as well.
Thanks for looking into it. Tested some more and teoretically i can get it to work by converting the data myself to and from latin1 and by not providing the encoding argument. This however is what I would expect happened internally when i provide the encoding.
library('magrittr')
conn <- DBI::dbConnect(
odbc::odbc()
, Driver = "ODBC Driver 17 for SQL Server"
, Server = "server"
, Database = 'database'
, Trusted_Connection = "Yes"
#, encoding = 'latin1'
, AutoTranslate = 'no'
)
test_table <- DBI::Id(schema = 'test', table = 'test_table')
df <- tibble::tibble(
var_char_col = c('kanin', 'ræven', 'ålens', 'ørred'),
bøvs = 1
)
df$var_char_col <- stringi::stri_encode(df$var_char_col, to = 'LATIN1')
DBI::dbWriteTable(
, conn =conn
, name = test_table
, value = df
, field.types = c('var_char_col' = 'varchar(5)')
, overwrite = TRUE
)
db_data <- DBI::dbReadTable(conn, test_table)
db_data
#> var_char_col bøvs
#> 1 kanin 1
#> 2 r\xe6ven 1
#> 3 \xe5lens 1
#> 4 \xf8rred 1
db_data$var_char_col <- stringi::stri_encode(
db_data$var_char_col
, from = 'latin1'
, to = 'UTF-8'
)
db_data
#> var_char_col bøvs
#> 1 kanin 1
#> 2 ræven 1
#> 3 ålens 1
#> 4 ørred 1
We are using SQL server 2019:
conn@info
...
#> $dbms.name
#> [1] "Microsoft SQL Server"
#>
#> $db.version
#> [1] "15.00.4375"
....
#> $drivername
#> [1] "libmsodbcsql-17.10.so.6.1"
#>
...
#> $driver.version
#> [1] "17.10.0006"
#>
....
Thanks for the investigation.
Here's some more notes from me:
AutoTranslate
parameter.
no
, VAR/CHAR
columns are returned across the odbc
membrane encoded in a CodePage
appropriate to the collation that is used - in your case Windows-1252
( or latin1
). However the column names are coming back as UTF-8
.yes
, data and column names are coming back as UTF-8
.encoding
and the AutoTranslate
parameter to your connection specification and hopefully everything will work as expected. When writing data, you should keep the encoding
parameter; I think when writing data the AutoTranslate
parameter is inconsequential.encoding
parameter is specified, it is used to convert both column names and column contents ( previously, only column contents ). I also tried to replicate some of the examples from #430 but was unable to - however I think I am missing some important pieces from @shrektan 's environment.So I tried writing data while only including the encoding parameter (Without AutoTranslate
). This did not end up in a correct result:
The same time I just reran my second example but with varchar(6). And while it from R makes both writing and retrieving look correct - the actual data in the database is incorrect:
R result:
library('magrittr')
conn <- DBI::dbConnect(
odbc::odbc()
, Driver = "ODBC Driver 17 for SQL Server"
, Server = "server"
, Database = 'database'
, Trusted_Connection = "Yes"
# , encoding = 'latin1'
, AutoTranslate = 'no'
)
test_table <- DBI::Id(schema = 'test', table = 'test_table')
df <- tibble::tibble(
var_char_col = c('kanin', 'ræven', 'ålens', 'ørred'),
bøvs = 1
)
DBI::dbWriteTable(
conn
, name = test_table
, df
, field.types = c('var_char_col' = 'varchar(6)')
, overwrite = TRUE
)
db_data <- DBI::dbReadTable(conn, test_table)
db_data
#> var_char_col bøvs
#> 1 kanin 1
#> 2 ræven 1
#> 3 ålens 1
#> 4 ørred 1
However I can confirm that for reading the data - it works fine without the encoding
and autotranslate
arguments.
Thanks @soetang
Do you have the ability to test development branches? If so can you see if with https://github.com/r-dbi/odbc/pull/845 we have restored the prior behavior ( everything works with the encoding
and AutoTranslate
arguments as in your original post ).
I am off the next days - but have asked my team if they can test it. Thx. Alot.
@detule I can confirm that it works with your branch.
We have been expriencing encoding problems with odbc 1.5.0 on SQL server. This works fine on odbc 1.4.2
All our databases unfortunately uses the varchar collation "Danish_Norwegian_CI_AS" - with odbc 1.4.2 however we were able to create a connection so that we could correctly read and write to the database. With odbc 1.5.0 the column names are no longer formatted correctly.
Is this a bug or can we change the connection settings so that it works correctly?
Expected value:
If we remove the encoding parameter - we get correct colum names, however the character vector no longer fit within the varchar(5) datatype even though we only have 5 characters: