sidorares / node-mysql2

:zap: fast mysqljs/mysql compatible mysql driver for node.js
https://sidorares.github.io/node-mysql2/
MIT License
4.07k stars 618 forks source link

sphinxsearch and astral characters #507

Closed rlidwka closed 7 years ago

rlidwka commented 7 years ago

Workaround — solved using code from here:

var pool = mysql.createPool({
  host: '10.0.3.77',
  port: 9306,
  connectionLimit: 10,
  typeCast: function (field, next) {
    if (field.type === 'STRING') {
      return field.buffer().toString('utf-8');
    }
    return next();
  }
})

I'm using mysql2 to connect to sphinx (that's a search engine that works over mysql 4.1 protocol, although sql syntax differs quite a bit). I was unable to reproduce following issue with a standard mysql server so far.

When I send a text there and get it back, astral characters (U+10000 and up, represented as surrogate pairs) gets replaced with 4 U+FFFD each.

I assume this is a bug in node-mysql2 because node-mysql works correctly in this exact case.

Source code:

//var mysql = require('mysql')
var mysql = require('mysql2')

var pool = mysql.createPool({
  host: '10.0.3.77',
  port: 9306,
  connectionLimit: 10
})

pool.getConnection(function (err, connection) {
  if (err) throw err

  connection.query(`CALL SNIPPETS(('test 😹 αβγ'), 'forum_posts', 'whatever')`,
    function (err, response) {
      if (err) throw err

      console.log(response)
    }
  )
})

Output with mysql module:

[ RowDataPacket { snippet: 'test 😹 αβγ' } ]

Output with mysql2 module:

[ TextRow { snippet: 'test ���� αβγ' } ]

Here's network traffic:

    00000000  43 00 00 00 0a 32 2e 33  2e 32 2d 69 64 36 34 2d C....2.3 .2-id64-
    00000010  62 65 74 61 20 28 3f 3f  3f 29 00 01 00 00 00 01 beta (?? ?)......
    00000020  02 03 04 05 06 07 08 00  08 82 21 02 00 00 00 00 ........ ..!.....
    00000030  00 00 00 00 00 00 00 00  00 00 01 02 03 04 05 06 ........ ........
    00000040  07 08 09 0a 0b 0c 0d                             .......
00000000  23 00 00 01 cf f3 82 00  00 00 00 00 e0 00 00 00 #....... ........
00000010  00 00 e0 01 00 00 00 00  90 37 25 03 00 00 00 00 ........ .7%.....
00000020  98 36 25 03 00 00 00                             .6%....
    00000047  07 00 00 02 00 00 00 00  00 00 00                ........ ...
00000027  3f 00 00 00 03 43 41 4c  4c 20 53 4e 49 50 50 45 ?....CAL L SNIPPE
00000037  54 53 28 28 27 74 65 73  74 20 f0 9f 98 b9 20 ce TS(('tes t .... .
00000047  b1 ce b2 ce b3 27 29 2c  20 27 66 6f 72 75 6d 5f .....'),  'forum_
00000057  70 6f 73 74 73 27 2c 20  27 77 68 61 74 65 76 65 posts',  'whateve
00000067  72 27 29                                         r')
    00000052  01 00 00 01 01 24 00 00  02 03 64 65 66 00 00 00 .....$.. ..def...
    00000062  07 73 6e 69 70 70 65 74  07 73 6e 69 70 70 65 74 .snippet .snippet
    00000072  0c 21 00 ff 00 00 00 fe  00 00 00 00 00 05 00 00 .!...... ........
    00000082  03 fe 00 00 00 00 11 00  00 04 10 74 65 73 74 20 ........ ...test 
    00000092  f0 9f 98 b9 20 ce b1 ce  b2 ce b3 05 00 00 05 fe .... ... ........
    000000A2  00 00 00 00                                      ....

Edit: added workaround on the top of the post

Edit 2: opened a bugreport against sphinx - http://sphinxsearch.com/bugs/view.php?id=2607

sidorares commented 7 years ago

The problem here is because sphinx reports that data is encoded using UTF8_GENERAL_CI, which is actually mysql name for CESU-8:

var Packets = require('./lib/packets/index.js');

var s = `
  01 00 00 01 01 24 00 00
  02 03 64 65 66 00 00 00
  07 73 6e 69 70 70 65 74
  07 73 6e 69 70 70 65 74
  0c 21 00 ff 00 00 00 fe
  00 00 00 00 00 05 00 00
  03 fe 00 00 00 00 11 00
  00 04 10 74 65 73 74 20
  f0 9f 98 b9 20 ce b1 ce
  b2 ce b3 05 00 00 05 fe
  00 00 00 00
`.split(/[ \n]+/).join('')
var b = Buffer.from(s, 'hex');

var conn = {
  config: {},
  serverEncoding: 0,
  clientEncoding: 0,
  _handshakePacket: {
    capabilityFlags: 0
  }
};

var p = new Packets.Packet(0, b, 0, b.length);
var header = new Packets.ResultSetHeader(p, conn);
console.log(header)

var p = new Packets.Packet(0, b, 5, b.length);
var col = new Packets.ColumnDefinition(p, conn.clientEncoding);
console.log(col)

first column is { catalog: 'def', schema: '', name: 'snippet', orgName: 'snippet', table: '', orgTable: '', characterSet: 33, columnLength: 255, columnType: 254, flags: 0, decimals: 0 }, and 33 is code for UTF8_GENERAL_CI

mysqljs/mysql is always using utf-8 for incoming data, where mysql2 decodes based on what server reports for column encoding

sidorares commented 7 years ago
var Packets = require('./lib/packets/index.js');

var s = `
  01 00 00 01 01 24 00 00
  02 03 64 65 66 00 00 00
  07 73 6e 69 70 70 65 74
  07 73 6e 69 70 70 65 74
  0c 21 00 ff 00 00 00 fe
  00 00 00 00 00 05 00 00
  03 fe 00 00 00 00 11 00
  00 04 10 74 65 73 74 20
  f0 9f 98 b9 20 ce b1 ce
  b2 ce b3 05 00 00 05 fe
  00 00 00 00
`.split(/[ \n]+/).join('')
var b = Buffer.from(s, 'hex');

var conn = {
  config: {},
  serverEncoding: 0,
  clientEncoding: 0,
  _handshakePacket: {
    capabilityFlags: 0
  }
};

var p = new Packets.Packet(0, b, 0, b.length);
var header = new Packets.ResultSetHeader(p, conn);
console.log(header)
var p = new Packets.Packet(0, b, 5, b.length);
var col = new Packets.ColumnDefinition(p, conn.clientEncoding);
console.log(col)
console.log(p.offset)

var CharsetToEncoding = require('./lib/constants/charset_encodings.js');
var compileParser = require('./lib/compile_text_parser.js');

var RowParser = compileParser([col], {}, {})

console.log(RowParser.toString())

var row = new RowParser(new Packets.Packet(0, b, 54, b.length), [col], {}, CharsetToEncoding)
console.log(row)

CharsetToEncoding[33] = 'utf8'
var row = new RowParser(new Packets.Packet(0, b, 54, b.length), [col], {}, CharsetToEncoding)
console.log(row)

outputs

TextRow { snippet: 'test ���� αβγ' }
TextRow { snippet: 'test 😹 αβγ' }
sidorares commented 7 years ago

@rlidwka I guess simple hackish (not very future proof) way to handle this for you might be this

var CharsetToEncoding = require('mysql2/lib/constants/charset_encodings.js');
CharsetToEncoding[33] = 'utf8'

this would force mysql2 to decode fields with encoding 33 as utf8

Do you know if sphinx server respects connection time encoding flags? What are results if you connect like this:

var mysql = require('mysql2')

var pool = mysql.createPool({
  host: '10.0.3.77',
  port: 9306,
  connectionLimit: 10,
  charset: 'UTF8MB4_GENERAL_CI'
})

///...
sidorares commented 7 years ago

@rlidwka do you know any simple docker image for spinx I can try locally without any extra setop to test this?

sidorares commented 7 years ago

also this might work - .query("set character_set_results 'utf8mb4'") ( or SET NAMES utf8mb4 )

puzrin commented 7 years ago

We do not use docker yet. Need latest 2.3.2-beta and build it from sources. But i think stable 2.2.11 should behave the same way. Packages are at http://sphinxsearch.com/downloads/release/.

puzrin commented 7 years ago

Do i understand right, that sphinx probably reports bad encoding for returned data packets (cesu-8 instead of utf-8)?

sidorares commented 7 years ago

Do i understand right, that sphinx probably reports bad encoding for returned data packets (cesu-8 instead of utf-8)?

I would assume that they use UTF8_GENERAL_CI as if it's utf8 and encode using [more modern] utf8 instead of cesu-8

sidorares commented 7 years ago

IMO worth reporting at shpinx

sidorares commented 7 years ago
> var i = require('iconv-lite')
undefined
> i.encode('😹 ', 'cesu-8')
<Buffer ed a0 bd ed b8 b9 20>
> i.decode(i.encode('😹 ', 'cesu-8'), 'cesu-8')
'😹 '
> i.encode('😹 ', 'utf8')
<Buffer f0 9f 98 b9 20>
> i.decode(i.encode('😹 ', 'cesu-8'), 'utf8')
'������ '
>
puzrin commented 7 years ago

IMO worth reporting at shpinx

Done http://sphinxsearch.com/bugs/view.php?id=2607. But, to be honest, they fix public reports veeery sloooow.

It's better to find the most simple workaround. Doing .query("set character_set_results 'utf8mb4'") after each connection is not cool. Option in createPool would be fine, if it helps.

PS. now we use temporary kludge - encode astrals as entities :)

sidorares commented 7 years ago

@puzrin can you confirm if connect time encoding setting fixes the issue?

var pool = mysql.createPool({
  host: '10.0.3.77',
  port: 9306,
  connectionLimit: 10,
  charset: 'UTF8MB4_GENERAL_CI' /// <------------
})
puzrin commented 7 years ago

I'll ask @rlidwka to make tests today or tomorrow (need to finish urgent deals).

Digged docs for a while.

http://sphinxsearch.com/docs/devel.html#sphinxql-set

SET NAMES statement and SET @@variable_name syntax, both introduced in version 2.0.2-beta, do nothing. They were implemented to maintain compatibility with 3rd party MySQL client libraries, connectors, and frameworks that may need to run this statement when connecting.

CHARACTER_SET_RESULTS = charset_name Does nothing; a placeholder to support frameworks, clients, and connectors that attempt to automatically enforce a charset when connecting to a Sphinx server. Introduced in version 2.0.1-beta.

Also from changelog:

Removed charset_type and mssql_unicode - we now support only UTF-8 encoding.

sidorares commented 7 years ago

looks like the only short term option is to use something like

var CharsetToEncoding = require('mysql2/lib/constants/charset_encodings.js');
CharsetToEncoding[33] = 'utf8'

long term - (1) wait for them to update result encoding to be UTF8MB4_GENERAL_CI ( 45 ) or (2) mysql2 to start maintaining table of exceptions to CharsetToEncoding based on reported server name/version

puzrin commented 7 years ago

We use mysql2 with real mysql server too. It's dangerous to do global overrides.

Is it real to add pool option to skip incoming data recode as mysql does? I understand this is a hack, but looks more attractive than waiting for sphinx and waiting for exceptions support.

sidorares commented 7 years ago

@puzrin you can use type casting functionality ( see docs at https://github.com/mysqljs/mysql#type-casting )

var pool = mysql.createPool({
  host: '10.0.3.77',
  port: 9306,
  connectionLimit: 10,
  typeCast: function (field, next) {
    if (field.type === 'STRING') {
      return field.buffer().toString('utf-8');
    }
    return next();
  }
})
rlidwka commented 7 years ago

@sidorares , thanks for answering. I've been trying those options:

var CharsetToEncoding = require('mysql2/lib/constants/charset_encodings.js'); CharsetToEncoding[33] = 'utf8'

This works.

you can use type casting functionality

typeCast also works.

Do you know if sphinx server respects connection time encoding flags? What are results if you connect like this: charset: 'UTF8MB4_GENERAL_CI'

Doesn't work, server's output is the same.

also this might work - .query("set character_set_results 'utf8mb4'") ( or SET NAMES utf8mb4 )

Doesn't work (tried set character_set_results=utf8mb4 and SET NAMES utf8mb4).

@rlidwka do you know any simple docker image for spinx I can try locally without any extra setop to test this?

Nope, sorry. I've never actually used docker.

sidorares commented 7 years ago

looks like we have working solution and it's mostly sphinx, closing this

puzrin commented 7 years ago

@sidorares thanks for your help! Final workaround info was added to first post. May be that will simplify life for someone else.