pinojs / sonic-boom

Extremely fast utf8 only stream implementation
MIT License
266 stars 41 forks source link

Problem about message missing when write CJK string #139

Closed hicdre closed 10 months ago

hicdre commented 2 years ago
const SonicBoom = require('sonic-boom');
const sonic = new SonicBoom({ fd: process.stdout.fd }); 

sonic.write('测试一二三四五六七八九十\n');
sonic.write('零\n');  // this is missing in stdout

console.log('finish');

The problem is about line below https://github.com/pinojs/sonic-boom/blob/7b4e25f26d77d875e5224aae7454c1b48455296f/index.js#L164-L165 n is the byte length of this._writingBuf, but it may not equals the characters length of the string which actually encoded in utf16. So this._len would be less than zero. It will lead to message missing.

console.assert('测试一二三四五六七八九十\n'.length === 13);
console.assert(Buffer.from('测试一二三四五六七八九十\n', 'utf8').length === 37);
ronag commented 2 years ago

Good catch! @mcollina I don't see a good way to fix this without some form of perf regression. I think we have 2 options:

In write we:

  1. Buffer.byteLength(str)
  2. Always convert to Buffer in write.
jsumners commented 2 years ago

This module's readme clearly states it is a utf8 only stream module.

ronag commented 2 years ago

All strings in javascript are utf-16...

jsumners commented 2 years ago

All strings in javascript are utf-16...

So they are https://262.ecma-international.org/11.0/#sec-ecmascript-language-types-string-type

hicdre commented 2 years ago

pinojs/pino-pretty#324

zbinlin commented 2 years ago

Can you fix this issue? @mcollina @ronag @jsumners I use Fastify v4, it will cause the log not to be flushed.

mcollina commented 2 years ago

That's unlikely to happen in the short term. I'd be happy to see a PR that fixes it (without regressing the throughput).

mcollina commented 2 years ago

Here is a PR with a partial fix: https://github.com/pinojs/sonic-boom/pull/154

mcollina commented 2 years ago

The next step would be to convert this._writingBuf from being a String to being a Buffer.

jsumners commented 2 years ago

For whomever tackles this by swapping out the String backing to a Buffer, keep in mind that there are multiple types of multi-byte utf-8 characters. See https://github.com/ldapjs/filter/blob/1423612281ccd43fb5f45dc9b534d91969d3cc6e/lib/utils/escape-filter-value.js for some code that covers 2 and 3 byte characters.