nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
106.52k stars 29.03k forks source link

Buffer.from() output breaks after optimizing in nodejs 22.7 #54521

Closed unilynx closed 2 weeks ago

unilynx commented 3 weeks ago

Version

v22.7.0

Platform

Darwin moe.fritz.box 23.5.0 Darwin Kernel Version 23.5.0: Wed May  1 20:13:18 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6030 arm64

Subsystem

Buffer

What steps will reproduce the bug?

let i = 0;
for(; i < 1_000_000; i++) {
  const ashex = Buffer.from("\x80").toString("hex");
  if (ashex === '80')
    break;
 else if(ashex !== 'c280')
    console.log("Unexpected return value", ashex); //this never happened for me
}

if(i<1_000_000) {
  console.log("FAILED after %d iterations",i);
  process.exit(1);
} else
  console.log("PASSED after %d iterations",i);

How often does it reproduce? Is there a required condition?

For me, it will consistently fail somewhere between 7000 to 1200 iterations.

What is the expected behavior? Why is that the expected behavior?

In node 20 this code does not fail , even after 1_000_000 iterations. Buffer.from("\x80").toString("hex") always returns c280 on node 20

What do you see instead?

Buffer.from("\x80").toString("hex") incorrectly returns 80 after sufficient iterations

Additional information

No response

RedYetiDev commented 3 weeks ago

I managed to reproduce using your code snippet, and the slightly modified one below (for readability)

let i = 0;

while (i < 1_000_000) {
  const asHex = Buffer.from("\x80").toString("hex");

  if (asHex === '80') {
    break;
  } else if (asHex !== 'c280') {
    console.log("Unexpected return value:", asHex);
  }

  i++;
}

if (i < 1_000_000) {
  console.log("FAILED after %d iterations", i);
  process.exit(1);
} else {
  console.log("PASSED after %d iterations", i);
}
$ node repro.js
FAILED after 8827 iterations
$ node repro.js
FAILED after 5509 iterations
$ node repro.js
FAILED after 48741 iterations
$ node repro.js
FAILED after 7487 iterations
$ node repro.js
FAILED after 9189 iterations
unilynx commented 2 weeks ago

bisecting gives me, as first bad commit:

commit c9dabe2927374e15d970814af2f758d05c2f977c Author: Robert Nagy ronagy@icloud.com Date: Thu Aug 15 03:01:05 2024 +0200

buffer: use fast API for writing one-byte strings

however, that one's even worse and shows a lot of incorrect 'asHex' values:

Unexpected return value afdead0b
Unexpected return value 00dead0b
Unexpected return value 00000000
Unexpected return value 00dead0b
Unexpected return value 01dead0b
Unexpected return value 00000000
Unexpected return value 905e0150
Unexpected return value 00738148

so it looks like the more extreme bug got fixed (probably in 78008935b0d66fe61f7b507c6b529aac097c8560) but then issues still remained for (eg) \x80

RedYetiDev commented 2 weeks ago

CC @ronag

unilynx commented 2 weeks ago

just to confirm the above - if I take the v22.x branch (5ae0260837ce72e97caca75568f6edabd414d905) but revert 78008935b0d66fe61f7b507c6b529aac097c8560 and a5a60e68232ae26615ff2c297bea49314354c42c, the issue is no longer present

ronag commented 2 weeks ago

I'll have a look.

ronag commented 2 weeks ago
let i = 0;

while (i < 1_000_000) {
  const buf = Buffer.from("\x80")

  if (buf[0] !== 194 || buf[1] !== 128) {
    console.log("Unexpected return value:", buf, buf[0], buf[1]);
    break
  }

  i++;
}

if (i < 1_000_000) {
  console.log("FAILED after %d iterations", i);
  process.exit(1);
} else {
  console.log("PASSED after %d iterations", i);
}
ronag commented 2 weeks ago

This seems to be something in V8.

In

uint32_t FastWriteString(Local<Value> receiver,
                         const v8::FastApiTypedArray<uint8_t>& dst,
                         const v8::FastOneByteString& src,
                         uint32_t offset,
                         uint32_t max_length) {
}

src.length === 1 i.e. FastOneByteString thinks that Buffer.from(['\x80']) has a length of 1.

@targos @joyeecheung @nodejs/buffer

ronag commented 2 weeks ago

Which is kind of correct... is it the slow path that is broken?

ronag commented 2 weeks ago

actually @anonrig @lemire might have ideas?

ronag commented 2 weeks ago

Is the problem here that we are not handling incomplete utf8 sequences?

ronag commented 2 weeks ago

I'm not sure where the c2 from the expected c280 is coming from?

unilynx commented 2 weeks ago

If 0x80 was a valid Unicode position, 0xC2 0x80 woud be its UTF8 Encoding. At the very least, this is a regression compared to earlier node (or V8?) versions.

I'm not sure about correctness or what the spec says - JavaScript has always kind of tolerated invalid unicode in its string (otherwise Buffer.from(xxx, "latin1") wouldn't be useful either). I think https://github.com/nodejs/node/issues/54518#issuecomment-2306833198 is coming from the same direction (I'm also seeing issues other Invalid UTF-8 encoding issues pop up - 0x80 was just the cleanest example of the regression)

Anyway, if we ignore 0x80 for now... 0xA0 (non breaking space)is certainly a valid Unicode character and perhaps the most often occurring Unicode codepoint above 128. It should encode to 0xC2 0xA0 in utf8

let i = 0;
let prev = '';
for(; i < 1_000_000; i++) {
  const ashex = Buffer.from("\xA0").toString("hex");
  if(prev !== ashex) {
    console.log("Got ", ashex);
    prev = ashex;
  }
}

returns

Got  c2a0
Got  a0

The encoding here also changes for the fast path

node 20 only returns c2a0

ronag commented 2 weeks ago

Looking at the slow implementation there is a REPLACE_INVALID_UTF8 flag which I guess is what is causing the difference.

@lemire does simdutf have a "memcpy" + replace invalid utf8 routine?

lemire commented 2 weeks ago

@ronag

does simdutf have a "memcpy" + replace invalid utf8 routine?

It does not. The simdutf library does not do replacement. It only transcodes valid inputs.

lemire commented 2 weeks ago

@ronag

In simdutf, the expectation (currently) is that you assume that the input is valid. If it is not, then simdutf will tell you, and you can use a slow path. We expect most inputs to be valid strings...

ronag commented 2 weeks ago

so is there a way I could do:

if (!simd_utf8_copy(src, dst, size)) {
  // slow path
}
lemire commented 2 weeks ago

so is there a way I could do:

Let me have a look at the offending code.

lemire commented 2 weeks ago

@ronag Let me create a PR (for discussion).

lemire commented 2 weeks ago

@ronag I see you were faster (https://github.com/nodejs/node/pull/54526), so let me go to https://github.com/nodejs/node/pull/54526 instead of me creating a PR.

RedYetiDev commented 2 weeks ago

Fixed by #54565