Open jimmywarting opened 3 years ago
I think TextDecoder
encoding support is limited to utf-8
while String decoder
has also support for utf-16
, but maybe that's fine.
I think
TextDecoder
encoding support is limited toutf-8
whileString decoder
has also support forutf-16
, but maybe that's fine.
TextDecoder
supports a lot of different encodings: https://nodejs.org/dist/latest-v16.x/docs/api/util.html#util_encodings_supported_by_default_with_full_icu_data
My bad, I mixed it with TextEncoder
! Please disregard my previous comment.
In a quick test, TextDecoder is an order of magnitude slower to decode a Uint8Array to a String than StringDecoder using Node 16. And I can't find anything that works faster than StringDecoder. Please don't deprecate StringDecoder until the performance is at least equivalent.
@mojavelinux would you mind sharing a performence test case? ... with test results?
We maybe won't go as far as deprecating it and eventually removing it. Maybe we will just mark it as legacy and say it's best avoided if you write cross compatible code in the docs.
Have you also considered that TextEncoder takes longer cuz it can solve some problems StringDecoder dose not handle? It maybe solves something better that StringDecoder dose not do, like BOM for instance
Also curios how the browserified version compares to native TextDecoder, this is the main reason why i want to see less use StringDecoder... To stop ppl from exporting node modules into browsers and increasing the bundle size too much and writing better cross compatible code
string_decoder depends on Buffer so depending on https://www.npmjs.com/package/string_decoder in the browser adds a hole bunch of bloatware to your bundle, and the browser Buffer is not as fast as node's Buffer
const { StringDecoder } = require('string_decoder');
const data = new Uint8Array([239, 187, 191, 49]) // BOM + 1(49)
// fails
JSON.parse(new StringDecoder('utf8').write(data))
JSON.parse(Buffer.from(data.buffer).toString())
// works
JSON.parse(new TextDecoder().decode(data))
new Response(data).json()
I belive fetch decodes data similar to how TextDecoder dose it, more correctly according to a spec... related: https://github.com/node-fetch/node-fetch/issues/541
A pretty basic test can demonstrate the order of magnitude change:
const iterations = 10000000
const stringDecoder = new (require('string_decoder').StringDecoder)()
const textDecoder = new (require('util').TextDecoder)()
const SAMPLE = new Uint8Array([112,97,103,101,45,111,110,101,46,97,100,111,99])
function variationA (arr) {
return stringDecoder.write(arr)
}
function variationB (arr) {
return textDecoder.decode(arr)
}
let result
const variation = process.argv[2] === 'A' ? variationA : variationB
console.time(variation.name)
for (let i = 0; i < iterations; i++) {
result = variation(SAMPLE)
}
console.timeEnd(variation.name)
On my machine, the results are as follows:
node perf.js A
//=> variationA: 940.728ms
node perf.js B
//=> variationB: 8.244s
In Chrome, I get:
variationB: 5491ms
It maybe solves something better that StringDecoder dose not do, like BOM for instance
Perhaps I don't need that behavior ;) I also don't think BOM processing warrants an order of magnitude change difference in execution time.
In Chrome, I get: variationB: 5491ms
Missing variationA in browser
I don't know what the API in the browser is for StringDecoder.
I don't know what the API in the browser is for StringDecoder.
something like:
npx browserify file.js > out.js
// file.js
x = require('string_decoder')
I'm fine with removing the public use of StringDecoder. Perhaps under the covers in Node.js, TextDecoder can use the StringDecoder code when the encoding is utf-8 and ignoreBOM is true. I just don't want to have to bear such a tremendous drop in performance as it will affect the performance of my application (which calls this method hundreds of thousands of times).
npx browserify file.js > out.js
All that's doing is testing the implementation in Node.js in a browser. I wouldn't expect there to be any difference since it's the same code (and the same JavaScript engine). I thought you were comparing to a native implementation in the browser.
As I suspected, the performance is just as good in the browser for StringDecoder. In Chrome, I get:
variationA: 1389ms
What we see is that the performance of the native TextDecoder in Chrome is better than the TextDecoder in Node.js. But there is still a measurable difference when compared to StringDecoder.
A quick run with the profiler shows most of the time is spent in one function:
Actually, another run shows something different...
I modified the test a bit and run the test on some webpages with a realistic larger sample set
const SAMPLE = new TextEncoder().encode(document.body.innerText)
it shows variantA is much slower... (in chrome after being browserified)
const iterations = 1000
const stringDecoder = new (require('string_decoder').StringDecoder)()
const textDecoder = (new TextDecoder())
const SAMPLE = new TextEncoder().encode(document.body.innerText)
function variationA (arr) {
return stringDecoder.write(arr)
}
function variationB (arr) {
return textDecoder.decode(arr)
}
let result
console.time(variationA.name)
for (let i = 0; i < iterations; i++) {
result = variationA(SAMPLE)
}
console.timeEnd(variationA.name)
console.time(variationB.name)
for (let i = 0; i < iterations; i++) {
result = variationB(SAMPLE)
}
console.timeEnd(variationB.name)
on nodejs.org innerText
variationA: 31.10400390625 ms variationB: 1.6748046875 ms
here is my browser bundle: https://pastebin.com/DDMKi3yH (for those who do not want to run browserify and wants to test it out for themself) the test are at the bottom, the fact that it requires ~2400 lines of code in the browser and run much slower is exactly the reason why i want to discourage use of it
EDIT:
here is a cdn solution:
const {StringDecoder} = await import('https://jspm.dev/string_decoder')
const iterations = 1000
const stringDecoder = new StringDecoder()
const textDecoder = new TextDecoder()
const SAMPLE = new TextEncoder().encode(document.body.innerText)
function variationA (arr) {
return stringDecoder.write(arr)
}
function variationB (arr) {
return textDecoder.decode(arr)
}
let result
console.time(variationA.name)
for (let i = 0; i < iterations; i++) {
result = variationA(SAMPLE)
}
console.timeEnd(variationA.name)
console.time(variationB.name)
for (let i = 0; i < iterations; i++) {
result = variationB(SAMPLE)
}
console.timeEnd(variationB.name)
I can confirm that TextDecoder does seem to perform better for larger sample sizes. Perhaps there is a problem with the implementation that it performs so poorly for a small sample size. I don't know enough at this point to begin to be able to explain what's going on. These are certainly useful observations.
I did other runs (of https://github.com/nodejs/node/issues/39301#issuecomment-877623195) with the code from master instead of a release and here's what I get:
The 60% of "self time" in TextDecoder.decode
actually come from the call to the C++ function here: https://github.com/nodejs/node/blob/e2148d742549b7c88891bb50b4b4a562bf4bd46a/src/node_i18n.cc#L432
@mojavelinux:
Please don't deprecate StringDecode...
Marking an API as legacy is not the same as deprecating it. APIs that are marked legacy are unlikely to ever change or be deprecated.
Thanks for pointing that out. Sorry for creating any confusion by mixing up the terminology.
The node:string_decoder
has a feature that it holds the bytes until a full unicode character can be printed, which I don't think TextDecoder
provides:
When a Buffer instance is written to the StringDecoder instance, an internal buffer is used to ensure that the decoded string does not contain any incomplete multibyte characters. These are held in the buffer until the next call to stringDecoder.write() or until stringDecoder.end() is called.
This feature is very convenient when printing outputs byte by byte, for example when implementing the streaming interface of LLM inference.
You can do that with TextDecoder
as well:
> var d = new TextDecoder
undefined
> d.decode(Uint8Array.of(0xe2), { stream: true })
''
> d.decode(Uint8Array.of(0x82), { stream: true })
''
> d.decode(Uint8Array.of(0xAC), { stream: true })
'€'
📗 API Reference Docs Problem
https://nodejs.org/docs/latest-v16.x/api/string_decoder.html
Description
Like with util inherit and querystring can you also mark string_decoder as legacy and say something in terms of:
Would be best for cross platform coding... Maybe even possible deprecate it?