Respecting encoding? - Githubissues

kirbysayshi commented 10 years ago

Shouldn't enc in the below example be 'utf8'? In node 0.10.26 it's buffer:

var fs = require('fs');
var through2 = require('through2');
fs.createReadStream('./file.json', { encoding: 'utf8' })
  .pipe(through2(function(chunk, enc, cb) {
    console.log('chunk', chunk, 'enc', enc);
    this.push(chunk);
    cb();
  }))

Is it expected that transforms are doing string conversions? Am I doing something wrong?

rvagg commented 10 years ago

that's a good question actually... try through2({ encoding: 'utf8' }, function (chunk, enc, cb) { ... instead to explicitly set it on your transform.

kirbysayshi commented 10 years ago

So I gave that a try, still got a buffer:

var fs = require('fs');
var through2 = require('through2');
fs.createReadStream('./file.json', { encoding: 'utf8' })
  .pipe(through2({ encoding: 'utf8' }, function(chunk, enc, cb) {
    console.log('chunk', chunk, 'enc', enc);
    this.push(chunk);
    cb();
  }))

$ node stream_test.js
chunk <Buffer 7b 0a 20 20 22 6e 61 6d 65 22 3a 20 22 61 70 70 2d 74 72 61 69 6e 69 6e 67 2d 30 31 22 2c 0a 20 20 22 76 65 72 73 69 6f 6e 22 3a 20 22 30 2e 30 2e 30 22 ...> enc buffer

rvagg commented 10 years ago

@raynos help! I don't have time to try and understand this at the moment, any chance you understand this off the top of your head?

laurelnaiad commented 10 years ago

You might try setting {decodeStrings: false} on the through2 options (as they should pass through to the transform stream constructor). http://nodejs.org/api/stream.html#stream_new_stream_transform_options

edit: curiousity got me -- yes, that works.

kirbysayshi commented 10 years ago

@stu-salsbury Thanks! I suppose this works this way so that a transform can be configured by its creator to know what type of data to transform. Still a little weird though that encoding: 'utf8' doesn't force decodeStrings to false. Although perhaps encoding is passed to the readable component of the transform, and not the writable.

laurelnaiad commented 10 years ago

np -- I've been spending more time than I ever intended reading and rereading that page lately!

alessioalex commented 10 years ago

Note to myself: should try to figure out how to fix this.

alessioalex commented 10 years ago

Ok so I think I've got to the bottom of it:

https://github.com/isaacs/readable-stream/blob/master/lib/_stream_writable.js#L233-L240

The thing is that there is an option called decodeStrings which is set by default to true, even when the encoding is set to utf8. By passing decodeStrings:false as an option to through2 the encoding will be respected.

Raynos commented 10 years ago

Seems like a bug.

Raynos commented 10 years ago

we shouldnt encode a buffer as utf8 to just decode it to a buffer.

laurelnaiad commented 10 years ago

Why should through2 not respect the decoding defaults of node? I think it's working as it should be designed...

That is to say that I don't think through2 should make assumptions about encoding or decoding, and the node Transform stream defaults should be through2's defaults. Both decodeStrings and encoding are valid properties of Transform stream that the user of through2 can set according to their intent, since through2 rightly passes through the options specified in its parameter to the Transform stream constructor.

If the thinking is that node's Transform stream defaults are the bug, then I don't think altering the defaults in through2 is an appropriate way of expressing that.

Raynos commented 10 years ago

@stu-salsbury agreed. this sounds like a node core bug.

laurelnaiad commented 10 years ago

quoting from the docs -- "encoding String If specified, then buffers will be decoded to strings using the specified encoding." So this says that if a buffer comes in, do you treat it as if it is encoded a certain way.... if an incoming stream is already a string, as is the case in the example above, then this option will have no effect... so the point here is that you aren't actually ever decoding a buffer in through2 if you get an incoming string... and that decodeStrings = false is telling the transform stream not to decode back to a buffer. I agree that this is a strange choice though perhaps meant to encourage buffer usage and hence drive performance? I dunno. Somebody went to the trouble to document that decodeStrings defaults to true so it seems intentional.

EDIT: I wonder if this is a backward compatibility issue for node, more than a choice that anyone would support in a greenfield design.

TakenPilot commented 10 years ago

If this is still an open issue, what would be needed to resolve it?

If the start of stream is being defined somewhere else (in another function, another class, etc.), it may not be clear what encoding the person using through2 should expect. Example:

//this is really annoying
someObject.getStream().pipe(through2(function (chunk, enc, cb) {
  switch(enc) {
    case 'utf8':
    case 'base64':
    etc...
  }
}))

Therefore, there is a big advantage to being able to set the encoding as an option of through2, and there is also a big advantage in when you don't want to. I think that having the setting optional, as well as being close to the function using the data, is very useful.

So is this really an issue?

heroboy commented 8 years ago

Hi, my English is bad, so I will express myself in code.

var d = through(opts, transform);
//then we have three questions.
1. d.write(buf);             //buf is Buffer or String
2. function transform(chunk);//chunk is Buffer or String
3. d.on("data", data=> { }); //data is Buffer or String
//the answer is
if (buf is Buffer)
{
    chunk is Buffer;
}
else // buf is String
{
    if (opts.decodeStrings == false)
        chunk is String;
    else
        chunk is Buffer;
}
if (opts.encoding)
   data is String;
else
   data is Buffer;

So if you want your transformFucntion to process strings, but the input buf are Buffers. You need do like this:

through({encoding:"utf8"})                 //convert Buffer to String (buf => data)
    .pipe(through({decodeStrings:false},   //prevent convert String to Buffer (buf => chunk)
        function(chunk,enc,cb){}           //chunk is string
);

The relations are so complex,I think through2 should make these more easily

evg-zhabotinsky commented 8 years ago

I have a problem of trying to get to the bottom of things too often, and I found that this behavior actually boils down to following:

In Node, non-object streams do not have encoding, semantically they are byte streams. IO operations have encoding, and streams only provide defaults that can be overridden for separate operations. If read has no encoding, it produces buffer instead of string, and write ignores encoding for buffers.
.pipe() seems to treat all streams as object streams, i.e. it reads/writes whole strings with stream's default encoding. For example, here goes utf8->utf16le converter:
```
process.stdin.setEncoding('utf8');
process.stdout.setDefaultEncoding('utf16le');
process.stdin.pipe(process.stdout);
```
There are writableObjectMode=true and readableObjectMode=true options (and objectMode=true that enables both). They enable Object Mode for write and read sides of transform stream.
decodeStrings=false is a performance hack which only disables automatic string decoding on writable stream, for when it is not needed. It looks awfully like Object Mode, but at least semantically it is not.

Considering all above, observed behavior is not a problem (at least not with through2 or transform streams). What OP tried to do (or at least ended up with) is object-input stream (don't know about output), so he should have used writableObjectMode=true. In his code, previous stream in pipeline is responsible for decoding and slicing up data, transform stream gets finished objects.

The real problem is that I got all of above from experimenting and reading library code, not from documentation.

hollowdoor commented 7 years ago

@evg-zhabotinsky Knowing that maybe something like this should be done.

var t2 = new DestroyableTransform(options)
//Right after construction.
//Maybe a more precise condition would be better.
//if(!!options.decodeStrings && options.encoding !== 'buffer'){
if(options.encoding !== 'buffer'){ 

    t2.once('pipe', function(src){
        //This is emitted synchronously in the src.pipe method
        if(src._readableState.encoding !== 'buffer'){
             t2._writableState.objectMode = true;
        }   
    }); 
}

But that might cause problems for people who expect buffers all the time because of how the current streams work.

rvagg / through2

Respecting encoding? #14