sidorares / node-mysql2

:zap: fast mysqljs/mysql compatible mysql driver for node.js
https://sidorares.github.io/node-mysql2/
MIT License
4.07k stars 618 forks source link

Update collation list up to MySQL 8.0.26 #1410

Closed testn closed 2 years ago

testn commented 3 years ago

I noticed that some collations are missing so I pulled up the latest collation list from MySQL by

mysql> select concat('exports.',UPPER(COLLATION_NAME),' = ',id,';')
    ->        FROM INFORMATION_SCHEMA.COLLATIONS
    ->        ORDER BY ID;
testn commented 3 years ago

utf8_general50_ci and ucs2_general50_ci seem to be removed.

https://bugs.launchpad.net/percona-server/+bug/1163324

sidorares commented 3 years ago

while you at it - can you have a look at making utf8mb3 an alias for cesu-8? should fix #1398 #1240 #1333

sidorares commented 3 years ago

utf8_general50_ci and ucs2_general50_ci seem to be removed

would it hurt leaving them? not sure if removing might affect user connecting to server version 5.0

testn commented 3 years ago

I think utf8mb3 should be aliased to utf8 for now. Aliasing it to cesu-8 will hurt the performance significantly

sidorares commented 3 years ago

mysql utf8 strings are encoded as cesu-8, "real" utf8 is named utf8mb4 in mysql. utf8mb3 is same as utf8 ( that is, cesu-8 )

testn commented 3 years ago

mysql utf8 strings are encoded as cesu-8, "real" utf8 is named utf8mb4 in mysql. utf8mb3 is same as utf8 ( that is, cesu-8 )

I think that is the future. Currently, it is still utf8mb3 (https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8.html)

Also this seems to clarify the differences between cesu8 and utf8mb3 https://www.wikiwand.com/en/Talk:UTF-8

sidorares commented 3 years ago

lets tackle utf8 / utf8mb3 / cesu-8 separately. Thanks for the link, interesting read!

testn commented 3 years ago

I just tried to store 😂 into utf8_unicode_ci and it said

    code: 'ER_TRUNCATED_WRONG_VALUE_FOR_FIELD',
    errno: 1366,
    sqlState: 'HY000',
    sqlMessage: "Incorrect string value: '\\xF0\\x9F\\x98\\x82' for column 'field' at row 1",
    sql: "INSERT INTO `test-charset-encoding` (field) values('😂')"
sidorares commented 3 years ago

I just tried to store 😂 into utf8_unicode_ci and it said

is this good or bad? :) ( e.i is this a positive test for 5117e4f5048beb18f4144bb26a9fffe1cbc2bbb3 ? )

testn commented 3 years ago

I just tried to store 😂 into utf8_unicode_ci and it said

is this good or bad? :) ( e.i is this a positive test for 5117e4f ? )

I think it's good. It shows that utf8mb3 cannot store non-BMP characters.

sidorares commented 3 years ago

I believe I was able to insert non-BMP 💩 using utf8 mysql encoding ( see https://github.com/sidorares/node-mysql2/pull/374#issuecomment-241748134 ), so not sure if that confirms "mysql utf8 = cesu-8", utf8mb3 = same as cecu-8 but only BMP is supported, utf8mb4 = "normal modern utf8"

testn commented 3 years ago

if you do it that way, you are converting 💩 into 6 invalid bytes (2 UTF-8 characters) in MySQL. MySQL may not be able to understand that character. It may be able to pass it in/out as if it is a two character though.

However, I believe if you try to insert 💩 using .query() method, you won't be able to to that, right?

sidorares commented 3 years ago

unit test: https://github.com/sidorares/node-mysql2/blob/694e1003732f048c91a2ab99f7a7667d88a02549/test/integration/connection/encoding/test-non-bmp-chars.js#L9-L15

testn commented 3 years ago

Can you try to insert 💩 into a column with UTF8_GENERAL_CI?

testn commented 3 years ago

@sidorares re:columnname, I think we should always use utf8 to decode the column name regardless the characterSet returned in Packet.FieldDefinition. We don't need to use cesu8 to decode it as the column should only contain BMP characters only. So whether you use cesu8 or utf8, they won't be any differences.

sidorares commented 3 years ago

can column name use other encodings ( win1251, koi8 for Cyrillic as an example )? If yes not sure if worth adding exception / custom logic for perf benefits

testn commented 2 years ago

Let me check.

ahmedbodi commented 2 years ago

any update on this?

sidorares commented 2 years ago

@ahmedbodi I think the discussion got sidetracked a bit into more complex issue, updated collation list LGTM to me. Merging now