Closed unilynx closed 2 weeks ago
I managed to reproduce using your code snippet, and the slightly modified one below (for readability)
let i = 0;
while (i < 1_000_000) {
const asHex = Buffer.from("\x80").toString("hex");
if (asHex === '80') {
break;
} else if (asHex !== 'c280') {
console.log("Unexpected return value:", asHex);
}
i++;
}
if (i < 1_000_000) {
console.log("FAILED after %d iterations", i);
process.exit(1);
} else {
console.log("PASSED after %d iterations", i);
}
$ node repro.js
FAILED after 8827 iterations
$ node repro.js
FAILED after 5509 iterations
$ node repro.js
FAILED after 48741 iterations
$ node repro.js
FAILED after 7487 iterations
$ node repro.js
FAILED after 9189 iterations
bisecting gives me, as first bad commit:
commit c9dabe2927374e15d970814af2f758d05c2f977c Author: Robert Nagy ronagy@icloud.com Date: Thu Aug 15 03:01:05 2024 +0200
buffer: use fast API for writing one-byte strings
however, that one's even worse and shows a lot of incorrect 'asHex' values:
Unexpected return value afdead0b
Unexpected return value 00dead0b
Unexpected return value 00000000
Unexpected return value 00dead0b
Unexpected return value 01dead0b
Unexpected return value 00000000
Unexpected return value 905e0150
Unexpected return value 00738148
so it looks like the more extreme bug got fixed (probably in 78008935b0d66fe61f7b507c6b529aac097c8560) but then issues still remained for (eg) \x80
CC @ronag
just to confirm the above - if I take the v22.x branch (5ae0260837ce72e97caca75568f6edabd414d905) but revert 78008935b0d66fe61f7b507c6b529aac097c8560 and a5a60e68232ae26615ff2c297bea49314354c42c, the issue is no longer present
I'll have a look.
let i = 0;
while (i < 1_000_000) {
const buf = Buffer.from("\x80")
if (buf[0] !== 194 || buf[1] !== 128) {
console.log("Unexpected return value:", buf, buf[0], buf[1]);
break
}
i++;
}
if (i < 1_000_000) {
console.log("FAILED after %d iterations", i);
process.exit(1);
} else {
console.log("PASSED after %d iterations", i);
}
This seems to be something in V8.
In
uint32_t FastWriteString(Local<Value> receiver,
const v8::FastApiTypedArray<uint8_t>& dst,
const v8::FastOneByteString& src,
uint32_t offset,
uint32_t max_length) {
}
src.length === 1
i.e. FastOneByteString
thinks that Buffer.from(['\x80'])
has a length of 1.
@targos @joyeecheung @nodejs/buffer
Which is kind of correct... is it the slow path that is broken?
actually @anonrig @lemire might have ideas?
Is the problem here that we are not handling incomplete utf8 sequences?
I'm not sure where the c2
from the expected c280
is coming from?
If 0x80 was a valid Unicode position, 0xC2 0x80 woud be its UTF8 Encoding. At the very least, this is a regression compared to earlier node (or V8?) versions.
I'm not sure about correctness or what the spec says - JavaScript has always kind of tolerated invalid unicode in its string (otherwise Buffer.from(xxx, "latin1") wouldn't be useful either). I think https://github.com/nodejs/node/issues/54518#issuecomment-2306833198 is coming from the same direction (I'm also seeing issues other Invalid UTF-8 encoding issues pop up - 0x80 was just the cleanest example of the regression)
Anyway, if we ignore 0x80 for now... 0xA0 (non breaking space)is certainly a valid Unicode character and perhaps the most often occurring Unicode codepoint above 128. It should encode to 0xC2 0xA0 in utf8
let i = 0;
let prev = '';
for(; i < 1_000_000; i++) {
const ashex = Buffer.from("\xA0").toString("hex");
if(prev !== ashex) {
console.log("Got ", ashex);
prev = ashex;
}
}
returns
Got c2a0
Got a0
The encoding here also changes for the fast path
node 20 only returns c2a0
Looking at the slow implementation there is a REPLACE_INVALID_UTF8
flag which I guess is what is causing the difference.
@lemire does simdutf have a "memcpy" + replace invalid utf8 routine?
@ronag
does simdutf have a "memcpy" + replace invalid utf8 routine?
It does not. The simdutf library does not do replacement. It only transcodes valid inputs.
@ronag
In simdutf, the expectation (currently) is that you assume that the input is valid. If it is not, then simdutf will tell you, and you can use a slow path. We expect most inputs to be valid strings...
so is there a way I could do:
if (!simd_utf8_copy(src, dst, size)) {
// slow path
}
so is there a way I could do:
Let me have a look at the offending code.
@ronag Let me create a PR (for discussion).
@ronag I see you were faster (https://github.com/nodejs/node/pull/54526), so let me go to https://github.com/nodejs/node/pull/54526 instead of me creating a PR.
Fixed by #54565
Version
v22.7.0
Platform
Subsystem
Buffer
What steps will reproduce the bug?
How often does it reproduce? Is there a required condition?
For me, it will consistently fail somewhere between 7000 to 1200 iterations.
What is the expected behavior? Why is that the expected behavior?
In node 20 this code does not fail , even after 1_000_000 iterations.
Buffer.from("\x80").toString("hex")
always returnsc280
on node 20What do you see instead?
Buffer.from("\x80").toString("hex")
incorrectly returns80
after sufficient iterationsAdditional information
No response