Closed kirbysayshi closed 6 years ago
that's a good question actually... try through2({ encoding: 'utf8' }, function (chunk, enc, cb) { ...
instead to explicitly set it on your transform.
So I gave that a try, still got a buffer:
var fs = require('fs');
var through2 = require('through2');
fs.createReadStream('./file.json', { encoding: 'utf8' })
.pipe(through2({ encoding: 'utf8' }, function(chunk, enc, cb) {
console.log('chunk', chunk, 'enc', enc);
this.push(chunk);
cb();
}))
$ node stream_test.js
chunk <Buffer 7b 0a 20 20 22 6e 61 6d 65 22 3a 20 22 61 70 70 2d 74 72 61 69 6e 69 6e 67 2d 30 31 22 2c 0a 20 20 22 76 65 72 73 69 6f 6e 22 3a 20 22 30 2e 30 2e 30 22 ...> enc buffer
@raynos help! I don't have time to try and understand this at the moment, any chance you understand this off the top of your head?
You might try setting {decodeStrings: false}
on the through2 options (as they should pass through to the transform stream constructor). http://nodejs.org/api/stream.html#stream_new_stream_transform_options
edit: curiousity got me -- yes, that works.
@stu-salsbury Thanks! I suppose this works this way so that a transform can be configured by its creator to know what type of data to transform. Still a little weird though that encoding: 'utf8'
doesn't force decodeStrings
to false
. Although perhaps encoding
is passed to the readable component of the transform, and not the writable.
np -- I've been spending more time than I ever intended reading and rereading that page lately!
Note to myself: should try to figure out how to fix this.
Ok so I think I've got to the bottom of it:
https://github.com/isaacs/readable-stream/blob/master/lib/_stream_writable.js#L233-L240
The thing is that there is an option called decodeStrings
which is set by default to true
, even when the encoding
is set to utf8
. By passing decodeStrings:false
as an option to through2
the encoding will be respected.
Seems like a bug.
we shouldnt encode a buffer as utf8 to just decode it to a buffer.
Why should through2 not respect the decoding defaults of node? I think it's working as it should be designed...
That is to say that I don't think through2 should make assumptions about encoding or decoding, and the node Transform
stream defaults should be through2's defaults. Both decodeStrings
and encoding
are valid properties of Transform stream that the user of through2 can set according to their intent, since through2 rightly passes through the options specified in its parameter to the Transform
stream constructor.
If the thinking is that node's Transform
stream defaults are the bug, then I don't think altering the defaults in through2 is an appropriate way of expressing that.
@stu-salsbury agreed. this sounds like a node core bug.
quoting from the docs -- "encoding String If specified, then buffers will be decoded to strings using the specified encoding." So this says that if a buffer comes in, do you treat it as if it is encoded a certain way.... if an incoming stream is already a string, as is the case in the example above, then this option will have no effect... so the point here is that you aren't actually ever decoding a buffer in through2 if you get an incoming string... and that decodeStrings = false is telling the transform stream not to decode back to a buffer. I agree that this is a strange choice though perhaps meant to encourage buffer usage and hence drive performance? I dunno. Somebody went to the trouble to document that decodeStrings defaults to true so it seems intentional.
EDIT: I wonder if this is a backward compatibility issue for node, more than a choice that anyone would support in a greenfield design.
If this is still an open issue, what would be needed to resolve it?
If the start of stream is being defined somewhere else (in another function, another class, etc.), it may not be clear what encoding the person using through2 should expect. Example:
//this is really annoying
someObject.getStream().pipe(through2(function (chunk, enc, cb) {
switch(enc) {
case 'utf8':
case 'base64':
etc...
}
}))
Therefore, there is a big advantage to being able to set the encoding as an option of through2, and there is also a big advantage in when you don't want to. I think that having the setting optional, as well as being close to the function using the data, is very useful.
So is this really an issue?
Hi, my English is bad, so I will express myself in code.
var d = through(opts, transform);
//then we have three questions.
1. d.write(buf); //buf is Buffer or String
2. function transform(chunk);//chunk is Buffer or String
3. d.on("data", data=> { }); //data is Buffer or String
//the answer is
if (buf is Buffer)
{
chunk is Buffer;
}
else // buf is String
{
if (opts.decodeStrings == false)
chunk is String;
else
chunk is Buffer;
}
if (opts.encoding)
data is String;
else
data is Buffer;
So if you want your transformFucntion
to process strings, but the input buf are Buffers. You need do like this:
through({encoding:"utf8"}) //convert Buffer to String (buf => data)
.pipe(through({decodeStrings:false}, //prevent convert String to Buffer (buf => chunk)
function(chunk,enc,cb){} //chunk is string
);
The relations are so complex,I think through2
should make these more easily
I have a problem of trying to get to the bottom of things too often, and I found that this behavior actually boils down to following:
read
has no encoding, it produces buffer instead of string, and write
ignores encoding for buffers..pipe()
seems to treat all streams as object streams, i.e. it reads/writes whole strings with stream's default encoding. For example, here goes utf8->utf16le converter:
process.stdin.setEncoding('utf8');
process.stdout.setDefaultEncoding('utf16le');
process.stdin.pipe(process.stdout);
writableObjectMode=true
and readableObjectMode=true
options (and objectMode=true
that enables both). They enable Object Mode for write and read sides of transform stream.decodeStrings=false
is a performance hack which only disables automatic string decoding on writable stream, for when it is not needed. It looks awfully like Object Mode, but at least semantically it is not.Considering all above, observed behavior is not a problem (at least not with through2 or transform streams). What OP tried to do (or at least ended up with) is object-input stream (don't know about output), so he should have used writableObjectMode=true
. In his code, previous stream in pipeline is responsible for decoding and slicing up data, transform stream gets finished objects.
The real problem is that I got all of above from experimenting and reading library code, not from documentation.
@evg-zhabotinsky Knowing that maybe something like this should be done.
var t2 = new DestroyableTransform(options)
//Right after construction.
//Maybe a more precise condition would be better.
//if(!!options.decodeStrings && options.encoding !== 'buffer'){
if(options.encoding !== 'buffer'){
t2.once('pipe', function(src){
//This is emitted synchronously in the src.pipe method
if(src._readableState.encoding !== 'buffer'){
t2._writableState.objectMode = true;
}
});
}
But that might cause problems for people who expect buffers all the time because of how the current streams work.
Shouldn't
enc
in the below example be 'utf8'? In node 0.10.26 it'sbuffer
:Is it expected that transforms are doing string conversions? Am I doing something wrong?