jbenet / transformer

transformer - multiformat data conversion
transform.datadex.io
130 stars 7 forks source link

Streams #3

Open jbenet opened 10 years ago

jbenet commented 10 years ago

@maxogden

Correctly handling streams is hard, because it's not clear when a codec should be done reading from a stream (i.e. how much xml to parse before the first "object" is found?). Like, if I'm parsing a type with schema:

{
  "schema": [ {
    "dob": "iso-date",
    "name": "person-name",
  } ]
}

It may be able to discern that an array of objects should be parsable object by object.

However, even if that's done correctly, it's not clear that the conversion won't be sync. I may be converting this list of {name/dob} objects to a type like:

{
  "schema": {
    "oldest-person": {
      "dob": "iso-date",
      "name": "person-name",
    },
    "youngest-person": {
      "dob": "iso-date",
      "name": "person-name",
    }
  }
}

Which would need to read the whole thing anyway to compute. Hmmmm. this could be a per codec thing-- implement encodeStream and encodeSync or something.

Also, worth noting most tiny codecs will just be sync -- things like parsing Dates.

jbenet commented 10 years ago

@mafintosh @maxogden would love to get your thoughts on how to make transformer handle streams correctly. In particular, given the complication of figuring out "what is a full object ready for conversion".

As discussed today, this is easy with ldjson (on second thought, consider renaming to ndjson, because ldjson and jsonld will be confusing to people), but not so easy when you have some complicated format. the json selector idea could help here, but not all inputs will be json. Maybe a way to go is:

jbenet commented 10 years ago

Unclear how best do to the stream interface. Could do:

stream wrap. pipes internally

var jsonStream = ...
var json2csv = transformer('json-stream', 'csv-stream');
var csvStream = json2csv(jsonStream)

stream-specific interface

var jsonStream = ...
var json2csvStream = transformer.stream('json-stream', 'csv-stream');
jsonStream.pipe(json2csvStream)
jbenet commented 10 years ago

Turns out the above interfaces have different use cases.

transform(er) streams

For example, it should be possible to convert simple objects like this:

var isodate = ... // stream that emits iso-date objects
var unixtime = transformer.stream('iso-date', 'unix-time')
isodate.pipe(unixtime)
// unixtime will emit converted items

which is the same as:

var isodate = ... // stream that emits iso-date objects
var iso2unix = transformer('iso-date', 'unix-time')
var unixtime = through2.obj(function(item, enc, next) {
  this.push(iso2unix(item))
  next()
});
isodate.pipe(unixtime)

Basically, a regular transform(er) stream.

transforming streams

Which is different from transforming one stream into another type of stream. (it sounds confusing, i know). consider:

var csvStream = ... // some stream emitting _rows_ of data.
csvStream.headers // is a row which contains the whole csv's headers
var jsonStream = through2.obj(csv2json)
csvStream.pipe(jsonStream)

function csv2json(row, enc, next) {
  var json = {}
  for (var i = 0; i < row.length; i++) {
    json[csvStream.headers[i]] = row[i]
  }
  next(json)
}

Here, the function inside the transform/through stream needs to track the headers somehow. In this case, the headers are a property on the stream. In other cases, the headers might come as the first item. However it happens, the transform function needs to be aware of the headers somehow (if they're not embedded in every item). For these cases, where the stream has to be "special", we need to "transform the stream" rather than "wrap the transform in a stream".

So, either "transform the stream":

var csvStream = ...
var csvs2jsons = transformer('csv-stream', 'json-stream')
var jsonStream = csvs2jsons(csvStream) // transform the stream!
// calls csvStream.pipe(jsonStream) internally.

Or, the transformer takes in only the options, can call pipe outside

var csvStream = ...
var csvs2jsons = transformer('csv-stream', 'json-stream')
var jsonStream = csvs2jsons(csvStream.headers)
csvStream.pipe(jsonStream) // call pipe explicitly

something in the middle?

Alternatively, these are hackier, but follow the transformer.stream api. They aren't technically "transforming the stream", rather setup the needed options. So:

Or, set a property right on the stream:

var csvStream = ...
var jsonStream = transformer.stream('csv', 'json') // expects .headers property
jsonStream.headers = csvStream.headers
csvStream.pipe(jsonStream)

Or, like the second "transforming the streams" example, construct the stream by calling the transformer with the needed options. if either of these approaches are picked, transformer.stream should always: return a function that returns the stream, even if no options are needed. (instead of sometimes returning a function and sometimes a stream).

var csvStream = ...
var csvs2jsons = transformer.stream('csv', 'json')
var jsonStream = csvs2jsons(csvStream.headers)
csvStream.pipe(jsonStream)
jbenet commented 10 years ago

@maxogden feedback? o/ lmk which feels more natural.