nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
106.62k stars 29.07k forks source link

provide a way to introduce new encodings for `new Buffer('someString', 'someEncoding')` #2835

Closed Mithgol closed 8 years ago

Mithgol commented 9 years ago

In the constructor new Buffer('someString', 'someEncoding') Node.js v4.0.0 itself supports a limited number of encodings: 'ascii', 'utf8', 'utf16le' (aka 'ucs2'), 'base64', 'hex', 'binary'. That's a half of a dozen. That's not plenty.

And that's why a widely used package iconv-lite (450+ direct dependents, ≈240+ thousands of daily downloads) provides a method (.extendNodeEncodings()) that adds a support of many other known encodings to the Buffer API.

However, iconv-lite does not seem to work in Node v4.0.0 well enough. Any use of an iconv-lite-provided encoding in an attempt of new Buffer('someLatinString', 'encodingName') results in some random output such as the following:

(screenshot)

It seems to me that either 0fa6c4a6fc7ed4a2dfe821f1d3a46d5f6ff43e69 was not enough to fix #1547 or a separate deeper issue exists.

Currently iconv-lite's extend-node.js changes the behaviour of the following methods:

in SlowBuffer in Buffer
SlowBuffer.prototype.toString Buffer.prototype.toString
SlowBuffer.prototype.write Buffer.prototype.write
SlowBuffer.byteLength Buffer.byteLength
  Buffer.isEncoding

Why those changes were enough in Node v0.10 and v0.12 but aren't in Node v4.0.0?

I may be wrong, but… it seems to me that in Node v0.12 the Buffer's constructor have used this.write(subject, encoding) internally but in the current Node v4.0.0 neither the Buffer's constructor nor its fromString helper do that. They seem to use binding.createFromString(string, encoding) (where necessary) or allocPool.write(string, poolOffset, encoding) and both of these come from process.binding('buffer') and aren't replaced by iconv-lite. And it won't be easy to replace them from userland, I suppose.

Is my assumption correct?

What should be done in iconv-lite (or in Node.js, or in both) for a multitude of encodings to work in the new Buffer('someString', 'encodingName') constructor correctly?

Mithgol commented 9 years ago

And a status badge goes to nodejs/node#2798.

Mithgol commented 9 years ago

This issue may be seen as a spiritual successor to nodejs/node-v0.x-archive#1772.

bnoordhuis commented 9 years ago

I'm inclined to say this is solely on iconv-lite's head for trying to monkey-patch core, and I say this as the author of a module that does something similar. Opinions, anyone?

ChALkeR commented 9 years ago

Monkey-patching is not supported, and will never be for obvious reasons. That includes both overriding builtin methods and adding new methods to builtins (both Node.js and v8). Such code could and will break in minor or patch versions, or even within the same Node.js version.

It's entirely iconv-lite fault for breaking here.

On the other hand, a supported method of defining new encodings seems like a valid idea, given that overriding an already defined (both built-in or thirdparty) encoding produces an error.

Fishrock123 commented 9 years ago

Introducing new encoding options from userland seems valid, so long as it doesn't require any V8 monkey-business.

Patching core, especially in this case, isn't exactly supported. Buffer changes were forced on us via V8, nothing we can really do.

ChALkeR commented 9 years ago

We could add API like Buffer.registerEncoding(encoding, toBinary, fromBinary), which would register encoding if it doesn't exist yet, and that will throw an error if that encoding is already defined.

ChALkeR commented 9 years ago

On the other hand, I see no actual profit in the above. I am just saying that doing so is possible.

As always, partial (~⅓) usage of .extendNodeEncodings:

a/alinator-0.2.2.tgz/lib/requests/scheduleRequest.js:9:iconv.extendNodeEncodings();
a/appbuilder-2.9.3-426.tgz/lib/common/mobile/android/android-emulator-services.js:29:        iconv.extendNodeEncodings();
c/ctbc_payment-0.2.1.tgz/lib/ctbc_creditcard.js:5:iconv.extendNodeEncodings();
c/ctbc_payment-0.2.1.tgz/lib/ctbc_unionpay.js:5:iconv.extendNodeEncodings();
d/delatinise-cli-0.1.11.tgz/lib/delatinise.js:159:      iconv.extendNodeEncodings();
f/fiunis-2.1.1.tgz/fiunis.js:1:require('iconv-lite').extendNodeEncodings();
f/fidonet-squish-0.0.39.tgz/fidonet-squish.js:5:require('iconv-lite').extendNodeEncodings();
f/fidonet-jam-3.8.7.tgz/fidonet-jam.js:6:require('iconv-lite').extendNodeEncodings();
h/hzip-1.1.0.tgz/encoding/iconv-lite/lib/extend-node.js:7:    iconv.extendNodeEncodings = function extendNodeEncodings() {
h/hzip-1.1.0.tgz/encoding/iconv-lite/lib/extend-node.js:187:            throw new Error("require('iconv-lite').undoExtendNodeEncodings(): Nothing to undo; extendNodeEncodings() is not called.")
i/iconv-lite-0.4.10.tgz/lib/extend-node.js:8:    iconv.extendNodeEncodings = function extendNodeEncodings() {
i/iconv-lite-0.4.10.tgz/lib/extend-node.js:179:            throw new Error("require('iconv-lite').undoExtendNodeEncodings(): Nothing to undo; extendNodeEncodings() is not called.")
l/linez-3.2.1.tgz/js/linez.js:4:iconv.extendNodeEncodings();
n/node-captions-0.4.2.tgz/captions.js:12:iconv.extendNodeEncodings();
n/node-cmpp-3-1.1.7.tgz/cmppSocket.js:17:iconv.extendNodeEncodings();
n/nativescript-1.0.2.tgz/lib/common/mobile/android/android-emulator-services.js:29:        iconv.extendNodeEncodings();
n/node-ral-0.0.36.tgz/lib/ral.js:25:iconv.extendNodeEncodings();
n/node-ral-0.0.36.tgz/test/protocol/http_protocol_post_test.js:17:iconv.extendNodeEncodings();
r/rastreiojs-0.2.4.tgz/lib/index.js:21:iconv.extendNodeEncodings();
s/simteconf-0.6.10.tgz/simteconf.js:10:require('iconv-lite').extendNodeEncodings();
s/slap-0.1.28.tgz/lib/cli.js:80:    iconv.extendNodeEncodings();
s/sro-0.1.4.tgz/lib/sro.js:13:iconv.extendNodeEncodings();
u/umpay-transfer-1.0.4.tgz/index.js:10:iconv.extendNodeEncodings();
u/umpay-1.1.1.tgz/index.js:10:iconv.extendNodeEncodings();

That's actually pretty low (18 modules). I expect the total count to be around 50-60 modules.

@Mithgol Could you explain how iconv.extendNodeEncodings() is better than manually converting those encodings, for the reference?

Mithgol commented 9 years ago

@ChALkeR

It's better because it's easier to introduce to an existing project. It feels almost infinitely easier.

You just write require('iconv-lite').extendNodeEncodings() once and suddenly it just works everywhere like a charm.

You don't have to do anything else.

For example, you don't have to remember every place where Buffer(string, encoding) was written previously and carefully upgrade them to iconvLite.encode while simultaneously keeping in mind that Buffer(string) (without encoding) defaults to UTF-8 but iconvLite.encode doesn't.

seishun commented 8 years ago

I don't think it should be supported to change the behavior of core functions from userland. We don't do that anywhere else as far as I'm aware.

danielgindi commented 8 years ago

I think that allowing a module to add more encodings to Buffer - makes so much sense, that I can't understand the resistance. Take for example a giant project, which has many points where encodings are used. It reads CSVs, it creates log files and Excel files, it is maybe reading outputs of other system modules or remote modules over the web. Encodings may vary! And you don't want to write so much code to handle encodings in so many places, it almost seems like a hack itself.

The proper solution would be to use the native Node.js features for handling Buffers and Encoding. But the thing that is missing is the support for new encodings.

Mithgol commented 8 years ago

@seishun

That's true, Node.js does not support extending the behaviour of its core functions anywhere else, but that's probably because Node.js does not have an obviously limited support of an obviously vast area (such as encodings) anywhere else.

Also, most of other such core modules are wrappers around third-party libraries, and that's their excuse. For example,

Mithgol commented 8 years ago

This comment is a nudge after a couple of months.

trevnorris commented 8 years ago

@Mithgol Basically taking @ChALkeR's API, would Buffer.registerEncoding(encoding, toBuffer, toString) work? Where toBuffer returns a Buffer and toString return a String. Both of which are propagated to the user.

Mithgol commented 8 years ago

Yes.

seishun commented 8 years ago

I don't think Buffer should be used for encoding/decoding beyond the few common encodings that it provides for convenience. It's simply outside of its scope of responsibility.

For example, you don't have to remember every place where Buffer(string, encoding) was written previously and carefully upgrade them to iconvLite.encode while simultaneously keeping in mind that Buffer(string) (without encoding) defaults to UTF-8 but iconvLite.encode doesn't.

TBH if one ends up with Buffer calls all over the codebase and needs to change the encoding in all of them, it seems like code smell. It's not node's job to help with bad architectural decisions.

And even in your example, you'd still have to carefully "upgrade" all the Buffer(string) calls to Buffer(string, 'whatever-encoding') anyway.

It's such a rare edge case that adding an API that will likely cause many issues down the road for questionable benefit doesn't seem worth it.

jasnell commented 8 years ago

Recommending that this be closed. Modules that want to provide support for other encodings can do so easily without monkey patching node or having node provide any kind of extension mechanism. I don't think we should be encouraging this anti-pattern further.

bnoordhuis commented 8 years ago

Let's close then.

Mithgol commented 8 years ago

Quick summary of changes that are necessary to do without require('iconv-lite').extendNodeEncodings():

¯\_(ツ)_/¯

ChALkeR commented 7 years ago

@Mithgol #13644 seems to solve this in a nicer (and more standardized) way. Does it cover the usecases which you have in mind for this?

Mithgol commented 7 years ago

@ChALkeR Unfortunately, not all of them, because the list of WHATWG-supported encodings seems to be quite limited.

For example, UTF-7 is not supported and hence I cannot use it for Fidonet Unicode substrings.

I'd better stay on iconv-lite.