provide a way to introduce new encodings for `new Buffer('someString', 'someEncoding')`

Mithgol commented 9 years ago

In the constructor new Buffer('someString', 'someEncoding') Node.js v4.0.0 itself supports a limited number of encodings: 'ascii', 'utf8', 'utf16le' (aka 'ucs2'), 'base64', 'hex', 'binary'. That's a half of a dozen. That's not plenty.

And that's why a widely used package iconv-lite (450+ direct dependents, ≈240+ thousands of daily downloads) provides a method (.extendNodeEncodings()) that adds a support of many other known encodings to the Buffer API.

However, iconv-lite does not seem to work in Node v4.0.0 well enough. Any use of an iconv-lite-provided encoding in an attempt of new Buffer('someLatinString', 'encodingName') results in some random output such as the following:

(screenshot)

It seems to me that either 0fa6c4a6fc7ed4a2dfe821f1d3a46d5f6ff43e69 was not enough to fix #1547 or a separate deeper issue exists.

Currently iconv-lite's extend-node.js changes the behaviour of the following methods:

in `SlowBuffer`	in `Buffer`
`SlowBuffer.prototype.toString`	`Buffer.prototype.toString`
`SlowBuffer.prototype.write`	`Buffer.prototype.write`
`SlowBuffer.byteLength`	`Buffer.byteLength`
	`Buffer.isEncoding`

Why those changes were enough in Node v0.10 and v0.12 but aren't in Node v4.0.0?

I may be wrong, but… it seems to me that in Node v0.12 the Buffer's constructor have used this.write(subject, encoding) internally but in the current Node v4.0.0 neither the Buffer's constructor nor its fromString helper do that. They seem to use binding.createFromString(string, encoding) (where necessary) or allocPool.write(string, poolOffset, encoding) and both of these come from process.binding('buffer') and aren't replaced by iconv-lite. And it won't be easy to replace them from userland, I suppose.

Is my assumption correct?

What should be done in iconv-lite (or in Node.js, or in both) for a multitude of encodings to work in the new Buffer('someString', 'encodingName') constructor correctly?

Mithgol commented 9 years ago

And a status badge goes to nodejs/node#2798.

Mithgol commented 9 years ago

This issue may be seen as a spiritual successor to nodejs/node-v0.x-archive#1772.

bnoordhuis commented 9 years ago

I'm inclined to say this is solely on iconv-lite's head for trying to monkey-patch core, and I say this as the author of a module that does something similar. Opinions, anyone?

ChALkeR commented 9 years ago

Monkey-patching is not supported, and will never be for obvious reasons. That includes both overriding builtin methods and adding new methods to builtins (both Node.js and v8). Such code could and will break in minor or patch versions, or even within the same Node.js version.

It's entirely iconv-lite fault for breaking here.

On the other hand, a supported method of defining new encodings seems like a valid idea, given that overriding an already defined (both built-in or thirdparty) encoding produces an error.

Fishrock123 commented 9 years ago

Introducing new encoding options from userland seems valid, so long as it doesn't require any V8 monkey-business.

Patching core, especially in this case, isn't exactly supported. Buffer changes were forced on us via V8, nothing we can really do.

ChALkeR commented 9 years ago

lib/buffer.js#L191 — .isEncoding. This could be patched to be aware of user-defined encodings.
lib/buffer.js#L342 — .toString. This could be patched to support user-defined .toString for a new encoding.
lib/buffer.js#L87 — .fromString. This could be patched to support user-defined .fromString for a new encoding.
lib/buffer.js#L534 — .write. This could be patched to support user-defined .write for a new encoding.

We could add API like Buffer.registerEncoding(encoding, toBinary, fromBinary), which would register encoding if it doesn't exist yet, and that will throw an error if that encoding is already defined.

ChALkeR commented 9 years ago

On the other hand, I see no actual profit in the above. I am just saying that doing so is possible.

As always, partial (~⅓) usage of .extendNodeEncodings:

a/alinator-0.2.2.tgz/lib/requests/scheduleRequest.js:9:iconv.extendNodeEncodings();
a/appbuilder-2.9.3-426.tgz/lib/common/mobile/android/android-emulator-services.js:29:        iconv.extendNodeEncodings();
c/ctbc_payment-0.2.1.tgz/lib/ctbc_creditcard.js:5:iconv.extendNodeEncodings();
c/ctbc_payment-0.2.1.tgz/lib/ctbc_unionpay.js:5:iconv.extendNodeEncodings();
d/delatinise-cli-0.1.11.tgz/lib/delatinise.js:159:      iconv.extendNodeEncodings();
f/fiunis-2.1.1.tgz/fiunis.js:1:require('iconv-lite').extendNodeEncodings();
f/fidonet-squish-0.0.39.tgz/fidonet-squish.js:5:require('iconv-lite').extendNodeEncodings();
f/fidonet-jam-3.8.7.tgz/fidonet-jam.js:6:require('iconv-lite').extendNodeEncodings();
h/hzip-1.1.0.tgz/encoding/iconv-lite/lib/extend-node.js:7:    iconv.extendNodeEncodings = function extendNodeEncodings() {
h/hzip-1.1.0.tgz/encoding/iconv-lite/lib/extend-node.js:187:            throw new Error("require('iconv-lite').undoExtendNodeEncodings(): Nothing to undo; extendNodeEncodings() is not called.")
i/iconv-lite-0.4.10.tgz/lib/extend-node.js:8:    iconv.extendNodeEncodings = function extendNodeEncodings() {
i/iconv-lite-0.4.10.tgz/lib/extend-node.js:179:            throw new Error("require('iconv-lite').undoExtendNodeEncodings(): Nothing to undo; extendNodeEncodings() is not called.")
l/linez-3.2.1.tgz/js/linez.js:4:iconv.extendNodeEncodings();
n/node-captions-0.4.2.tgz/captions.js:12:iconv.extendNodeEncodings();
n/node-cmpp-3-1.1.7.tgz/cmppSocket.js:17:iconv.extendNodeEncodings();
n/nativescript-1.0.2.tgz/lib/common/mobile/android/android-emulator-services.js:29:        iconv.extendNodeEncodings();
n/node-ral-0.0.36.tgz/lib/ral.js:25:iconv.extendNodeEncodings();
n/node-ral-0.0.36.tgz/test/protocol/http_protocol_post_test.js:17:iconv.extendNodeEncodings();
r/rastreiojs-0.2.4.tgz/lib/index.js:21:iconv.extendNodeEncodings();
s/simteconf-0.6.10.tgz/simteconf.js:10:require('iconv-lite').extendNodeEncodings();
s/slap-0.1.28.tgz/lib/cli.js:80:    iconv.extendNodeEncodings();
s/sro-0.1.4.tgz/lib/sro.js:13:iconv.extendNodeEncodings();
u/umpay-transfer-1.0.4.tgz/index.js:10:iconv.extendNodeEncodings();
u/umpay-1.1.1.tgz/index.js:10:iconv.extendNodeEncodings();

That's actually pretty low (18 modules). I expect the total count to be around 50-60 modules.

@Mithgol Could you explain how iconv.extendNodeEncodings() is better than manually converting those encodings, for the reference?

Mithgol commented 9 years ago

@ChALkeR

It's better because it's easier to introduce to an existing project. It feels almost infinitely easier.

You just write require('iconv-lite').extendNodeEncodings() once and suddenly it just works everywhere like a charm.

You don't have to do anything else.

For example, you don't have to remember every place where Buffer(string, encoding) was written previously and carefully upgrade them to iconvLite.encode while simultaneously keeping in mind that Buffer(string) (without encoding) defaults to UTF-8 but iconvLite.encode doesn't.

seishun commented 8 years ago

I don't think it should be supported to change the behavior of core functions from userland. We don't do that anywhere else as far as I'm aware.

danielgindi commented 8 years ago

I think that allowing a module to add more encodings to Buffer - makes so much sense, that I can't understand the resistance. Take for example a giant project, which has many points where encodings are used. It reads CSVs, it creates log files and Excel files, it is maybe reading outputs of other system modules or remote modules over the web. Encodings may vary! And you don't want to write so much code to handle encodings in so many places, it almost seems like a hack itself.

The proper solution would be to use the native Node.js features for handling Buffers and Encoding. But the thing that is missing is the support for new encodings.

Mithgol commented 8 years ago

@seishun

That's true, Node.js does not support extending the behaviour of its core functions anywhere else, but that's probably because Node.js does not have an obviously limited support of an obviously vast area (such as encodings) anywhere else.

Also, most of other such core modules are wrappers around third-party libraries, and that's their excuse. For example,

How should I extend crypto with SHA-3? — crypto is a wrapper around OpenSSL, just wait until that hash becomes supported in the upstream.
How should I extend Intl with a support for Tengwar? — We merely wrap around ICU, you should wait until Tengwar becomes supported by the Unicode Consortium according to their roadmap (subject to change).
How should I extend zlib with a support for unpacking .ZIP files in addition to its current inflating deflated streams? — zlib is merely a wrapper around the well-known zlib library and thus it's weird to add some features that are missing in the upstream.

Mithgol commented 8 years ago

This comment is a nudge after a couple of months.

trevnorris commented 8 years ago

@Mithgol Basically taking @ChALkeR's API, would Buffer.registerEncoding(encoding, toBuffer, toString) work? Where toBuffer returns a Buffer and toString return a String. Both of which are propagated to the user.

Mithgol commented 8 years ago

Yes.

seishun commented 8 years ago

I don't think Buffer should be used for encoding/decoding beyond the few common encodings that it provides for convenience. It's simply outside of its scope of responsibility.

For example, you don't have to remember every place where Buffer(string, encoding) was written previously and carefully upgrade them to iconvLite.encode while simultaneously keeping in mind that Buffer(string) (without encoding) defaults to UTF-8 but iconvLite.encode doesn't.

TBH if one ends up with Buffer calls all over the codebase and needs to change the encoding in all of them, it seems like code smell. It's not node's job to help with bad architectural decisions.

And even in your example, you'd still have to carefully "upgrade" all the Buffer(string) calls to Buffer(string, 'whatever-encoding') anyway.

It's such a rare edge case that adding an API that will likely cause many issues down the road for questionable benefit doesn't seem worth it.

jasnell commented 8 years ago

Recommending that this be closed. Modules that want to provide support for other encodings can do so easily without monkey patching node or having node provide any kind of extension mechanism. I don't think we should be encouraging this anti-pattern further.

bnoordhuis commented 8 years ago

Let's close then.

Mithgol commented 8 years ago

Quick summary of changes that are necessary to do without require('iconv-lite').extendNodeEncodings():

Obvious replacements:
- Buffer.isEncoding → iconv.encodingExists
- buf.toString(encoding) → iconv.decode(buf, encoding)
- buf(str, encoding) → iconv.encode(str, encoding)
Cannot use method chaining; for example, Buffer(str).toString('utf7') becomes iconv.decode(iconv.encode(str, 'utf8'), 'utf7')
Cannot use encoding options of fs methods; for example,
- fs.readFileSync(filename, {encoding: encoding}) → iconv.decode(fs.readFileSync(filename), encoding)
- fs.writeFileSync(filename, content, {encoding: encoding}) → fs.writeFileSync(filename, iconv.encode(content, encoding))
Cannot use simple buf.toString(encoding, start, end) conversion, should nest iconv.decode(buf.slice(start, end), encoding)
Sometimes adding iconv-lite to dependencies becomes necessary to check iconv.encodingExists explicitly in places where Buffer.isEncoding silently worked because of require('iconv-lite').extendNodeEncodings() in another dependency.

¯\_(ツ)_/¯

ChALkeR commented 7 years ago

@Mithgol #13644 seems to solve this in a nicer (and more standardized) way. Does it cover the usecases which you have in mind for this?

Mithgol commented 7 years ago

@ChALkeR Unfortunately, not all of them, because the list of WHATWG-supported encodings seems to be quite limited.

For example, UTF-7 is not supported and hence I cannot use it for Fidonet Unicode substrings.

I'd better stay on iconv-lite.

nodejs / node

provide a way to introduce new encodings for `new Buffer('someString', 'someEncoding')` #2835