Closed rlidwka closed 7 years ago
The problem here is because sphinx reports that data is encoded using UTF8_GENERAL_CI
, which is actually mysql name for CESU-8:
var Packets = require('./lib/packets/index.js');
var s = `
01 00 00 01 01 24 00 00
02 03 64 65 66 00 00 00
07 73 6e 69 70 70 65 74
07 73 6e 69 70 70 65 74
0c 21 00 ff 00 00 00 fe
00 00 00 00 00 05 00 00
03 fe 00 00 00 00 11 00
00 04 10 74 65 73 74 20
f0 9f 98 b9 20 ce b1 ce
b2 ce b3 05 00 00 05 fe
00 00 00 00
`.split(/[ \n]+/).join('')
var b = Buffer.from(s, 'hex');
var conn = {
config: {},
serverEncoding: 0,
clientEncoding: 0,
_handshakePacket: {
capabilityFlags: 0
}
};
var p = new Packets.Packet(0, b, 0, b.length);
var header = new Packets.ResultSetHeader(p, conn);
console.log(header)
var p = new Packets.Packet(0, b, 5, b.length);
var col = new Packets.ColumnDefinition(p, conn.clientEncoding);
console.log(col)
first column is { catalog: 'def', schema: '', name: 'snippet', orgName: 'snippet', table: '', orgTable: '', characterSet: 33, columnLength: 255, columnType: 254, flags: 0, decimals: 0 }
, and 33 is code for UTF8_GENERAL_CI
mysqljs/mysql is always using utf-8 for incoming data, where mysql2 decodes based on what server reports for column encoding
var Packets = require('./lib/packets/index.js');
var s = `
01 00 00 01 01 24 00 00
02 03 64 65 66 00 00 00
07 73 6e 69 70 70 65 74
07 73 6e 69 70 70 65 74
0c 21 00 ff 00 00 00 fe
00 00 00 00 00 05 00 00
03 fe 00 00 00 00 11 00
00 04 10 74 65 73 74 20
f0 9f 98 b9 20 ce b1 ce
b2 ce b3 05 00 00 05 fe
00 00 00 00
`.split(/[ \n]+/).join('')
var b = Buffer.from(s, 'hex');
var conn = {
config: {},
serverEncoding: 0,
clientEncoding: 0,
_handshakePacket: {
capabilityFlags: 0
}
};
var p = new Packets.Packet(0, b, 0, b.length);
var header = new Packets.ResultSetHeader(p, conn);
console.log(header)
var p = new Packets.Packet(0, b, 5, b.length);
var col = new Packets.ColumnDefinition(p, conn.clientEncoding);
console.log(col)
console.log(p.offset)
var CharsetToEncoding = require('./lib/constants/charset_encodings.js');
var compileParser = require('./lib/compile_text_parser.js');
var RowParser = compileParser([col], {}, {})
console.log(RowParser.toString())
var row = new RowParser(new Packets.Packet(0, b, 54, b.length), [col], {}, CharsetToEncoding)
console.log(row)
CharsetToEncoding[33] = 'utf8'
var row = new RowParser(new Packets.Packet(0, b, 54, b.length), [col], {}, CharsetToEncoding)
console.log(row)
outputs
TextRow { snippet: 'test ���� αβγ' }
TextRow { snippet: 'test 😹 αβγ' }
@rlidwka I guess simple hackish (not very future proof) way to handle this for you might be this
var CharsetToEncoding = require('mysql2/lib/constants/charset_encodings.js');
CharsetToEncoding[33] = 'utf8'
this would force mysql2 to decode fields with encoding 33 as utf8
Do you know if sphinx server respects connection time encoding flags? What are results if you connect like this:
var mysql = require('mysql2')
var pool = mysql.createPool({
host: '10.0.3.77',
port: 9306,
connectionLimit: 10,
charset: 'UTF8MB4_GENERAL_CI'
})
///...
@rlidwka do you know any simple docker image for spinx I can try locally without any extra setop to test this?
also this might work - .query("set character_set_results 'utf8mb4'")
( or SET NAMES utf8mb4
)
We do not use docker yet. Need latest 2.3.2-beta and build it from sources. But i think stable 2.2.11 should behave the same way. Packages are at http://sphinxsearch.com/downloads/release/.
Do i understand right, that sphinx probably reports bad encoding for returned data packets (cesu-8 instead of utf-8)?
Do i understand right, that sphinx probably reports bad encoding for returned data packets (cesu-8 instead of utf-8)?
I would assume that they use UTF8_GENERAL_CI
as if it's utf8 and encode using [more modern] utf8 instead of cesu-8
IMO worth reporting at shpinx
> var i = require('iconv-lite')
undefined
> i.encode('😹 ', 'cesu-8')
<Buffer ed a0 bd ed b8 b9 20>
> i.decode(i.encode('😹 ', 'cesu-8'), 'cesu-8')
'😹 '
> i.encode('😹 ', 'utf8')
<Buffer f0 9f 98 b9 20>
> i.decode(i.encode('😹 ', 'cesu-8'), 'utf8')
'������ '
>
IMO worth reporting at shpinx
Done http://sphinxsearch.com/bugs/view.php?id=2607. But, to be honest, they fix public reports veeery sloooow.
It's better to find the most simple workaround. Doing .query("set character_set_results 'utf8mb4'")
after each connection is not cool. Option in createPool would be fine, if it helps.
PS. now we use temporary kludge - encode astrals as entities :)
@puzrin can you confirm if connect time encoding setting fixes the issue?
var pool = mysql.createPool({
host: '10.0.3.77',
port: 9306,
connectionLimit: 10,
charset: 'UTF8MB4_GENERAL_CI' /// <------------
})
I'll ask @rlidwka to make tests today or tomorrow (need to finish urgent deals).
Digged docs for a while.
http://sphinxsearch.com/docs/devel.html#sphinxql-set
SET NAMES statement and SET @@variable_name syntax, both introduced in version 2.0.2-beta, do nothing. They were implemented to maintain compatibility with 3rd party MySQL client libraries, connectors, and frameworks that may need to run this statement when connecting.
CHARACTER_SET_RESULTS = charset_name Does nothing; a placeholder to support frameworks, clients, and connectors that attempt to automatically enforce a charset when connecting to a Sphinx server. Introduced in version 2.0.1-beta.
Also from changelog:
Removed charset_type and mssql_unicode - we now support only UTF-8 encoding.
looks like the only short term option is to use something like
var CharsetToEncoding = require('mysql2/lib/constants/charset_encodings.js');
CharsetToEncoding[33] = 'utf8'
long term - (1) wait for them to update result encoding to be UTF8MB4_GENERAL_CI ( 45 ) or (2) mysql2 to start maintaining table of exceptions to CharsetToEncoding based on reported server name/version
We use mysql2
with real mysql server too. It's dangerous to do global overrides.
Is it real to add pool option to skip incoming data recode as mysql
does? I understand this is a hack, but looks more attractive than waiting for sphinx and waiting for exceptions support.
@puzrin you can use type casting functionality ( see docs at https://github.com/mysqljs/mysql#type-casting )
var pool = mysql.createPool({
host: '10.0.3.77',
port: 9306,
connectionLimit: 10,
typeCast: function (field, next) {
if (field.type === 'STRING') {
return field.buffer().toString('utf-8');
}
return next();
}
})
@sidorares , thanks for answering. I've been trying those options:
var CharsetToEncoding = require('mysql2/lib/constants/charset_encodings.js'); CharsetToEncoding[33] = 'utf8'
This works.
you can use type casting functionality
typeCast also works.
Do you know if sphinx server respects connection time encoding flags? What are results if you connect like this: charset: 'UTF8MB4_GENERAL_CI'
Doesn't work, server's output is the same.
also this might work - .query("set character_set_results 'utf8mb4'") ( or SET NAMES utf8mb4 )
Doesn't work (tried set character_set_results=utf8mb4
and SET NAMES utf8mb4
).
@rlidwka do you know any simple docker image for spinx I can try locally without any extra setop to test this?
Nope, sorry. I've never actually used docker.
looks like we have working solution and it's mostly sphinx, closing this
@sidorares thanks for your help! Final workaround info was added to first post. May be that will simplify life for someone else.
Workaround — solved using code from here:
I'm using mysql2 to connect to sphinx (that's a search engine that works over mysql 4.1 protocol, although sql syntax differs quite a bit). I was unable to reproduce following issue with a standard mysql server so far.
When I send a text there and get it back, astral characters (U+10000 and up, represented as surrogate pairs) gets replaced with 4 U+FFFD each.
I assume this is a bug in
node-mysql2
becausenode-mysql
works correctly in this exact case.Source code:
Output with
mysql
module:Output with
mysql2
module:Here's network traffic:
Edit: added workaround on the top of the post
Edit 2: opened a bugreport against sphinx - http://sphinxsearch.com/bugs/view.php?id=2607