r-dbi / RMariaDB

An R interface to MariaDB
https://rmariadb.r-dbi.org
Other
129 stars 40 forks source link

Encoding issues with emojis on MySQL database when using load_data_local_infile #332

Closed ecoffingould closed 5 months ago

ecoffingould commented 5 months ago

I am writing a data set with emojis in it to a MySQL database. When the load_data_local_infile flag is set to TRUE (which is necessary for speed purposes), the write fails due to encoding issues. When the load_data_local_infile flag is not called or set to FALSE, the write succeeds, but goes much slower.

I have triple-checked the encoding in all other places — the MySQL database, the connection, the R file itself. All are encoded to utf-8, and can be written normally if the load_data_local_infile flag isn't used. This is also an issue that I didn't have on the old RMySQL package.

I also tried specifying the charset or the encoding in the dbConnect() stage, but that didn't seem to do anything. This is driving me crazy, please help.

df_with_emojis
  db_con _1 <- dbConnect(
    MariaDB(),
    host = Sys.getenv('mysql_url'),
    port = 3306 ,
    user = Sys.getenv('mysql_username') ,
    password = Sys.getenv('mysql_pw') ,
    db = 'my_db',
    load_data_local_infile = TRUE,
    bigint = "integer"
  )

  db_con _2 <- dbConnect(
    MariaDB(),
    host = Sys.getenv('mysql_url'),
    port = 3306 ,
    user = Sys.getenv('mysql_username') ,
    password = Sys.getenv('mysql_pw') ,
    db = 'my_db',
    bigint = "integer"
  )

dbWriteTable(db_con_1,table_name_here,df_with_emojis, overwrite = TRUE,row.names = FALSE)
# This fails with the encoding error "Error executing query: Invalid utf8mb3 character string: xyz problematic emoji line here"

dbWriteTable(db_con_2,table_name_here,df_with_emojis, overwrite = TRUE,row.names = FALSE)
# This works, but slower
krlmlr commented 5 months ago

Thanks, Eric. Is the following relevant?

https://stackoverflow.com/a/75810281/946850

ecoffingould commented 5 months ago

Thank you for the quick response!

My tables are in utf8mb4 — I checked the create statement for table_name_here. The default character set for my database + schema are both utf8mb4. My MySQL server version is 8.0.35

That makes sense with why this write works when load_data_local_infile is set to FALSE. And it's also why the error message is so weird — cause i've double- and triple-checked and the encodings are utf8mb4.

I think that whatever is happening in the load_data_local_infile step is somehow encoding the data as utf8mb3?

krlmlr commented 5 months ago

Thanks. With https://stackoverflow.com/a/75421637/946850, it becomes apparent that

https://github.com/r-dbi/RMariaDB/blob/24e25206f7739d4493dea2dda4cef6c8abed35d6/R/table.R#L61-L68

is wrong and should be utf8mb4 instead. Would you like to contribute a PR?

ecoffingould commented 5 months ago

Here you are!

krlmlr commented 5 months ago

Thanks for the PR!