nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
107.51k stars 29.56k forks source link

`Intl.Segmenter` has different output in Node and Chrome #51563

Open AudunWA opened 9 months ago

AudunWA commented 9 months ago

Version

v18.19.0

Platform

Darwin MacBook-Pro-10.local 23.1.0 Darwin Kernel Version 23.1.0: Mon Oct 9 21:28:45 PDT 2023; root:xnu-10002.41.9~6/RELEASE_ARM64_T6020 arm64

Subsystem

No response

What steps will reproduce the bug?

Run the following code in Node.js:

const segmenter = new Intl.Segmenter("en-GB", { granularity: "word" });
console.log(
    [...segmenter.segment("This is a sentence.This is another.")]
        .filter((it) => it.isWordLike)
        .map((it) => it.segment)
);

How often does it reproduce? Is there a required condition?

The results are always the same.

What is the expected behavior? Why is that the expected behavior?

Running the same code in the console of Chrome 120.0.6099.234 results in

['This', 'is', 'a', 'sentence', 'This', 'is', 'another']

I expect the output to be the same in Node.js and Chrome.

What do you see instead?

Running this code in Node.js results in the output

[ 'This', 'is', 'a', 'sentence.This', 'is', 'another' ]

Additional information

No response

richardlau commented 9 months ago

cc @nodejs/i18n-api Node.js 18.19.0 contains ICU 73.2 -- I'm not sure what version Chrome 120 uses.

srl295 commented 9 months ago

ICU 74.1 behaves like Node.js, per https://icu4c-demos.unicode.org/icu-bin/icusegments#1/en - i'm inclined to think Chrome is wrong here. I can confirm the Chrome behavior. My Chrome 120.0.6099.225 seems to have ICU 73.x

srl295 commented 9 months ago

Chrome uses customized ICU data. Maybe the segmenter data is scrambled.

V-yadav18 commented 9 months ago

@AudunWA you can use the unicode-segmentation library in Node.js, which provides a JavaScript implementation of Unicode segmentation algorithms.

srl295 commented 9 months ago

@V-yadav18 can you link to it here? Would be good to test that also.

srl295 commented 9 months ago

@V-yadav18 https://www.npmjs.com/package/unicode-segmentation is 404, can you put a link to the library you're referring to?

github-actions[bot] commented 2 months ago

This issue/PR was marked as stalled, it will be automatically closed in 30 days. If it should remain open, please leave a comment explaining why it should remain open.

RedYetiDev commented 2 months ago

@V-yadav18 npmjs.com/package/unicode-segmentation is 404, can you put a link to the library you're referring to?

The author still hasn't responded, so I've marked this issue has stalled. Feel free to undo.