vapor / mysql-kit

🐬 Pure Swift MySQL client built on non-blocking, event-driven sockets.
MIT License
223 stars 75 forks source link

Connection CharacterSet is incorrectly determined #322

Closed dezinezync closed 1 year ago

dezinezync commented 1 year ago

Describe the bug

For the following query:

SELECT * FROM performance_schema.session_variables
WHERE VARIABLE_NAME IN (
 'character_set_client', 'character_set_connection',
 'character_set_results', 'collation_connection'
) ORDER BY VARIABLE_NAME;

Sequel Pro reports the results as follows:

VARIABLE_NAME VARIABLE_VALUE
character_set_client utf8mb4
character_set_connection utf8mb4
character_set_results utf8mb4
collation_connection utf8mb4_general_ci
However, my Vapor-Fluent setup with the mysql driver reports it as follows: VARIABLE_NAME VARIABLE_VALUE
character_set_client latin1
character_set_connection latin1
character_set_results latin1
collation_connection latin1_swedish_ci

Configuring the database in Vapor:

import Vapor
import Fluent
import FluentMySQLDriver

public func configure(_ app: Application) throws {
  // ...
  app.databases.use(
    .mysql(
      hostname: readHostname,
      port: port,
      username: username,
      password: password,
      database: DB_NAME,
      tlsConfiguration: tlsConfig,
      maxConnectionsPerEventLoop: 5
    ),
    as: .readDB
  )
}

Many years ago, there was an option to set the CharacterSet on the connection, but has since been removed. I couldn't figure out a way to configure this with the latest versions of the related packages.

Due to the incorrect inferred character set, unicode characters get substituted by ? in VARCHAR and TEXT fields.

I'd appreciate any help debugging this issue or being able to configure the connection character set and collation params.

Environment

Related

The following issues/PRs may be related:

dezinezync commented 1 year ago

Temporary workaround: Executing the following query switches the connection to use the desired charset and collation:

SET NAMES 'utf8mb4' COLLATE 'utf8mb4_general_ci';
gwynne commented 1 year ago

This is addressed by the mysql-nio rewrite I've 90% completed (and trust me, no one's more sick of hearing me say it's "almost done" for a year straight than I am 😓😆). I'll make sure this issue gets mentioned in the PR as soon as it's up.

In the meantime, you can work around it by issuing an appropriate SET NAMES query, but because of how Fluent's connection pools currently work, the only way this can work reliably is if you run it inside a .withConnection() closure wrapped around every single route handler (and/or other database usage). For the record, the query would be as simple as:

try await (db as! any MySQLDatabase).simpleQuery("SET NAMES utf8mb4", onRow: { _ in }).run()

Technical note: This is equally as much an issue caused by MySQL 5.7's epically outdated handling of Unicode as it is any fault of the (very shoddy) existing MySQLNIO implementation. For some gory details, have a look at the documentation comments I've included with the relevant logic in the MySQLNIO rewrite: https://gist.github.com/gwynne/95679fc31b6b897799684ad5b9073066. (Please note this will almost certainly not be the final code; as mentioned, the rewrite is still in progress.)

P.S.: As per the comments in the Gist, you probably want to use the utf8mb4_unciode_520_ci collation if you can't update to MySQL 8.x (which I would very strongly recommend if at all possible).

dezinezync commented 1 year ago

@gwynne I really appreciate the super quick response, with details (especially on a weekend 🫡)

Glad to know this is being refactored and support will be landing soon.

In the meantime, you can work around it by issuing an appropriate SET NAMES query

Yes, I have a temporary workaround for this, but it'd be just cleaner to have upstream support.

As per the comments in the Gist, you probably want to use the utf8mb4_unciode_520_ci collation if you can't update to MySQL 8.x

I'll test with the suggested collation param and follow up shortly.

I'm unable to update to MySQL 8.x immediately, but this is scheduled for December. Do you reckon this is a non-issue when running MySQL 8.x or would the workaround still be necessary until your refactor is complete?

gwynne commented 1 year ago

The single-byte collection identifier that the current version of MySQLNIO sends in the protocol HandshakeResponse41 packet is 255, which corresponds to utf8mb4_0900_ai_ci, which means that with MySQL 8.x you get the correct behavior every time.

Of course, that collation ID doesn't exist yet in 5.7 (or, for that matter, in even the most recent versions of MariaDB), so the server falls back on its default-configured character set instead. 5.7 has still been getting maintenance updates all this time (although it's soon to FINALLY be declared EOL and put out of its misery at long last); at some point and they changed the default for new installations (and most hosted solutions and UNIX distros in general patched that default in a much longer time ago), so even those still using it have a decent chance of lucking into a correct configuration. Unfortunately, many - such as yourself - still end up with utf8 (aka utf8mb3, the crippled half-Unicode BMP-only encoding MySQL fail-invented) or latin1 - and since by deliberate design Unicode's first 256 codepoints correspond exactly to those of ISO-Latin-1 (which in turn means the first 128 are the ASCII table), it's easy to not notice it for an extended period.

dezinezync commented 1 year ago

As per the comments in the Gist, you probably want to use the utf8mb4_unciode_520_ci collation if you can't update to MySQL 8.x (which I would very strongly recommend if at all possible)

Tested locally updating MySQL to v8.0.3-rc (latest supported by the OS) and I can confirm this works without any workarounds.

I'll keep the workaround for my production MySQL server which I'm unable to update immediately to v8.

Of course, that collation ID doesn't exist yet in 5.7 (or, for that matter, in even the most recent versions of MariaDB), so the server falls back on its default-configured character set instead. 5.7 has still been getting maintenance updates all this time (although it's soon to FINALLY be declared EOL and put out of its misery at long last);

Yes, as per your suggestion, using utf8mb4_unicode_520_ci works as expected for now.

I'm closing this issue, and I look forward to your updates. I appreciate your help @gwynne.