jimmywarting commented 3 years ago

📗 API Reference Docs Problem

https://nodejs.org/docs/latest-v16.x/api/string_decoder.html

Description

Like with util inherit and querystring can you also mark string_decoder as legacy and say something in terms of:

The String decoder is considered Legacy. While it is still maintained, new code should use the TextDecoder API instead.

Would be best for cross platform coding... Maybe even possible deprecate it?

[ ] I would like to work on this issue and submit a pull request.

aduh95 commented 3 years ago

I think TextDecoder encoding support is limited to utf-8 while String decoder has also support for utf-16, but maybe that's fine.

targos commented 3 years ago

I think TextDecoder encoding support is limited to utf-8 while String decoder has also support for utf-16, but maybe that's fine.

TextDecoder supports a lot of different encodings: https://nodejs.org/dist/latest-v16.x/docs/api/util.html#util_encodings_supported_by_default_with_full_icu_data

aduh95 commented 3 years ago

My bad, I mixed it with TextEncoder! Please disregard my previous comment.

mojavelinux commented 3 years ago

In a quick test, TextDecoder is an order of magnitude slower to decode a Uint8Array to a String than StringDecoder using Node 16. And I can't find anything that works faster than StringDecoder. Please don't deprecate StringDecoder until the performance is at least equivalent.

jimmywarting commented 3 years ago

@mojavelinux would you mind sharing a performence test case? ... with test results?

We maybe won't go as far as deprecating it and eventually removing it. Maybe we will just mark it as legacy and say it's best avoided if you write cross compatible code in the docs.

Have you also considered that TextEncoder takes longer cuz it can solve some problems StringDecoder dose not handle? It maybe solves something better that StringDecoder dose not do, like BOM for instance

Also curios how the browserified version compares to native TextDecoder, this is the main reason why i want to see less use StringDecoder... To stop ppl from exporting node modules into browsers and increasing the bundle size too much and writing better cross compatible code

string_decoder depends on Buffer so depending on https://www.npmjs.com/package/string_decoder in the browser adds a hole bunch of bloatware to your bundle, and the browser Buffer is not as fast as node's Buffer

jimmywarting commented 3 years ago

const { StringDecoder } = require('string_decoder');

const data = new Uint8Array([239, 187, 191, 49]) // BOM + 1(49)

// fails 
JSON.parse(new StringDecoder('utf8').write(data))
JSON.parse(Buffer.from(data.buffer).toString())

// works
JSON.parse(new TextDecoder().decode(data))
new Response(data).json()

I belive fetch decodes data similar to how TextDecoder dose it, more correctly according to a spec... related: https://github.com/node-fetch/node-fetch/issues/541

mojavelinux commented 3 years ago

A pretty basic test can demonstrate the order of magnitude change:

const iterations = 10000000

const stringDecoder = new (require('string_decoder').StringDecoder)()
const textDecoder = new (require('util').TextDecoder)()

const SAMPLE = new Uint8Array([112,97,103,101,45,111,110,101,46,97,100,111,99])

function variationA (arr) {
  return stringDecoder.write(arr)
}

function variationB (arr) {
  return textDecoder.decode(arr)
}

let result
const variation = process.argv[2] === 'A' ? variationA : variationB
console.time(variation.name)
for (let i = 0; i < iterations; i++) {
  result = variation(SAMPLE)
}
console.timeEnd(variation.name)

On my machine, the results are as follows:

node perf.js A
//=> variationA: 940.728ms
node perf.js B
//=> variationB: 8.244s

In Chrome, I get:

variationB: 5491ms

It maybe solves something better that StringDecoder dose not do, like BOM for instance

Perhaps I don't need that behavior ;) I also don't think BOM processing warrants an order of magnitude change difference in execution time.

jimmywarting commented 3 years ago

In Chrome, I get: variationB: 5491ms

Missing variationA in browser

mojavelinux commented 3 years ago

I don't know what the API in the browser is for StringDecoder.

jimmywarting commented 3 years ago

I don't know what the API in the browser is for StringDecoder.

something like:

npx browserify file.js > out.js

// file.js
x = require('string_decoder')

mojavelinux commented 3 years ago

I'm fine with removing the public use of StringDecoder. Perhaps under the covers in Node.js, TextDecoder can use the StringDecoder code when the encoding is utf-8 and ignoreBOM is true. I just don't want to have to bear such a tremendous drop in performance as it will affect the performance of my application (which calls this method hundreds of thousands of times).

mojavelinux commented 3 years ago

npx browserify file.js > out.js

All that's doing is testing the implementation in Node.js in a browser. I wouldn't expect there to be any difference since it's the same code (and the same JavaScript engine). I thought you were comparing to a native implementation in the browser.

mojavelinux commented 3 years ago

As I suspected, the performance is just as good in the browser for StringDecoder. In Chrome, I get:

variationA: 1389ms

What we see is that the performance of the native TextDecoder in Chrome is better than the TextDecoder in Node.js. But there is still a measurable difference when compared to StringDecoder.

targos commented 3 years ago

A quick run with the profiler shows most of the time is spent in one function:

Actually, another run shows something different...

jimmywarting commented 3 years ago

I modified the test a bit and run the test on some webpages with a realistic larger sample set

const SAMPLE = new TextEncoder().encode(document.body.innerText)

it shows variantA is much slower... (in chrome after being browserified)

const iterations = 1000

const stringDecoder = new (require('string_decoder').StringDecoder)()
const textDecoder = (new TextDecoder())

const SAMPLE = new TextEncoder().encode(document.body.innerText)

function variationA (arr) {
  return stringDecoder.write(arr)
}

function variationB (arr) {
  return textDecoder.decode(arr)
}

let result

console.time(variationA.name)
for (let i = 0; i < iterations; i++) {
  result = variationA(SAMPLE)
}
console.timeEnd(variationA.name)

console.time(variationB.name)
for (let i = 0; i < iterations; i++) {
  result = variationB(SAMPLE)
}
console.timeEnd(variationB.name)

on nodejs.org innerText

variationA: 31.10400390625 ms variationB: 1.6748046875 ms

jimmywarting commented 3 years ago

here is my browser bundle: https://pastebin.com/DDMKi3yH (for those who do not want to run browserify and wants to test it out for themself) the test are at the bottom, the fact that it requires ~2400 lines of code in the browser and run much slower is exactly the reason why i want to discourage use of it

EDIT:

here is a cdn solution:

const {StringDecoder} = await import('https://jspm.dev/string_decoder')
const iterations = 1000
const stringDecoder = new StringDecoder()
const textDecoder = new TextDecoder()

const SAMPLE = new TextEncoder().encode(document.body.innerText)

function variationA (arr) {
  return stringDecoder.write(arr)
}

function variationB (arr) {
  return textDecoder.decode(arr)
}

let result

console.time(variationA.name)
for (let i = 0; i < iterations; i++) {
  result = variationA(SAMPLE)
}
console.timeEnd(variationA.name)

console.time(variationB.name)
for (let i = 0; i < iterations; i++) {
  result = variationB(SAMPLE)
}
console.timeEnd(variationB.name)

mojavelinux commented 3 years ago

I can confirm that TextDecoder does seem to perform better for larger sample sizes. Perhaps there is a problem with the implementation that it performs so poorly for a small sample size. I don't know enough at this point to begin to be able to explain what's going on. These are certainly useful observations.

targos commented 3 years ago

I did other runs (of https://github.com/nodejs/node/issues/39301#issuecomment-877623195) with the code from master instead of a release and here's what I get:

The 60% of "self time" in TextDecoder.decode actually come from the call to the C++ function here: https://github.com/nodejs/node/blob/e2148d742549b7c88891bb50b4b4a562bf4bd46a/src/node_i18n.cc#L432

jasnell commented 3 years ago

@mojavelinux:

Please don't deprecate StringDecode...

Marking an API as legacy is not the same as deprecating it. APIs that are marked legacy are unlikely to ever change or be deprecated.

mojavelinux commented 3 years ago

Thanks for pointing that out. Sorry for creating any confusion by mixing up the terminology.

zcbenz commented 2 months ago

The node:string_decoder has a feature that it holds the bytes until a full unicode character can be printed, which I don't think TextDecoder provides:

When a Buffer instance is written to the StringDecoder instance, an internal buffer is used to ensure that the decoded string does not contain any incomplete multibyte characters. These are held in the buffer until the next call to stringDecoder.write() or until stringDecoder.end() is called.

This feature is very convenient when printing outputs byte by byte, for example when implementing the streaming interface of LLM inference.

targos commented 2 months ago

You can do that with TextDecoder as well:

> var d = new TextDecoder
undefined
> d.decode(Uint8Array.of(0xe2), { stream: true })
''
> d.decode(Uint8Array.of(0x82), { stream: true })
''
> d.decode(Uint8Array.of(0xAC), { stream: true })
'€'

nodejs / node

doc: mark String decoder as legacy #39301

📗 API Reference Docs Problem

Description