node-task / spec

A specification for javascript tasks.
MIT License
153 stars 2 forks source link

tasks and pipelines #12

Closed cowboy closed 11 years ago

cowboy commented 11 years ago

So I have this idea.

What if node-tasks didn't care about reading or writing files? What if they used Buffers for input and output? Buffers are basically typed arrays that can be used to store arbitrarily-encoded string or binary data. You can basically encode and decode String <-> Buffer as long as you know the encoding.

So files still need to get read and written, of course. But instead of baking that logic into every task, there could be a "readfiles" node-task that reads files and outputs Buffers, along with a "writefiles" node-task that accepts Buffer input and writes files. Think of them as filesystem "adapter" tasks.

Then node-tasks can be even simpler.

Now, all a simple "concat" task cares about is its input Buffers and some options, like what it should use as a separator (if anything). It could basically return Buffer1 + Buffer2 + Buffer3. On the other hand, an "uglify" task might care about not only input Buffer contents, but also filepaths associated with each Buffer, for sourcemap generation.

So, what if we subclass Buffer and create FileBuffer. It's just like a Buffer but also has a filepath and an encoding property. Maybe it has readFile and writeFile methods. Maybe there are promises. I dunno.

For example, if path/to/foo.js is encoded as utf8 and read into a FileBuffer:

var FileBuffer = require('filebuffer').FileBuffer;
var fb = new FileBuffer();
fb.readFile('path/to/foo.js', 'utf8').then(stuff);
fb.filepath // 'path/to/foo.js'
fb.encoding // 'utf8'
fb.toString() // 'contents_of_foo' (decoded using this.encoding by default)

Ok, so anyways, all I'm getting at is that we have objects that have a filepath, encoding and Buffer of file contents. They can be plain JS objects, whatever. I'll use the term FileBuffer for now.

So, when we create our "concat" task, we could add in logic that says, "iterate over all source files. if source value is a FileBuffer, use that value. Otherwise if source value is a filepath, read that file using the appropriate encoding and store the results in a Buffer. Then, concatenate Buffers, done."

Or we could simplify do, "iterate over all source FileBuffers. Concatenate Buffers, done."

And the same with output. This "concat" task will need to either write some files or pass along output Buffers. So, simplify this and don't worry about file generation. Just return Buffers. Or FileBuffers. Whatever.

So why do I care about Buffers? Why not just write files? The answer is simple:

Pipelining.

Let's say I've got 3 tasks: csslint, concat, cssmin. Now, if every task only read and wrote files, you'd have a lot of temporary files. Because csslint would have to write files for concat to read, and concat would have to write files for cssmin to read. Etc.

So instead of writing temp files (which could totally be a valid option) we pass FileBuffers between tasks. Now, I'm not suggesting that tasks should be aware of other tasks. In fact, the opposite. Tasks should be super dumb. They just do what they do with the FileBuffers and options specified, end of story.

But the task runner could be told to take this set of file src-dest mappings, and from there, individually lint a.css and b.css, then concat a.css and b.css, then minify that concatenated source, then write ab.css. And do the same thing for c.css + d.css -> cd.css.

So the task runner passes something like this into the readfiles task:

files: [
  {src: ['a.css', 'b.css'], dest: 'ab.css'},
  {src: ['c.css', 'd.css'], dest: 'cd.css'},
]

Which resolves to a similar object, but with the filepaths updated to be FileBuffer objects. The dest properties are just passed through:

files: [
  {src: [FileBuffer{filepath: 'a.css', ...}, FileBuffer{filepath: 'b.css', ...}], dest: 'ab.css'},
  {src: [FileBuffer{filepath: 'c.css', ...}, FileBuffer{filepath: 'd.css', ...}], dest: 'cd.css'},
]

The task runner could then pass that object into csslint. Let's say it also doesn't care about the dest values. All it does is pass the exact same input object along if successful. That way the next task in the pipeline can just continue along.

The next task, "concat" see that same input object and actually does its work of combining the input FileBuffers. What it passes along looks something like this. The output of this task is the input to the next task, at least files-wise.

I'm not sure if there's a way to specify "interim" filenames for the FileBuffers. Like, what would happen if you tried to use "uglify" to create sourcemaps without interim files? Does that even make sense? For this example, I just had "concat" use the dest filepaths.

files: [
  {src: [FileBuffer {filepath: 'ab.css', ...}, dest: 'ab.css'},
  {src: [FileBuffer {filepath: 'cd.css', ...}, dest: 'cd.css'},
]

So, anyways, this "concat" output is passed into "cssmin" which returns a files array that looks the same, but has all new FileBuffer values for the src. Becuase they now contain minified css.

Finally, the "cssmin" result object is passed into the "writefiles" task which takes a files list that should be single src Buffers mapped to single dest values. And writes some files.

While tasks aren't really useful on their own, provided the readfiles and writefiles tasks exist, All one would need to do to use "concat" on its own would be something like.

readfiles.then(concat).then(writefiles);

Vs. the whole pipeline, which would look like this, but would be abstracted away by a task runner:

readfiles.then(csslint).then(concat).then(cssmin).then(writefiles)

And if someone didn't want to read or write files, but do something else with http or whatever, they wouldn't need extra task options. They'd just create "adapter" read/write tasks, and use those instead of the filesystem ones.

Now obviously, the implementation would be more complex. We'd need to pass options around, etc. But the basic idea of making Buffer-based tasks that don't care about file reading or writing would make them super modular and very promise-friendly.

That's my $0.02 for the day.

tkellen commented 11 years ago

So, basically something like this: https://github.com/livingsocial/rake-pipeline/blob/master/lib/rake-pipeline/file_wrapper.rb

tkellen commented 11 years ago

Also, I'm totally down to explore this idea.

cowboy commented 11 years ago

Apparently so. Glad to see that I'm not crazy.

sindresorhus commented 11 years ago

I've been asking for this. :+1:

paulmillr commented 11 years ago

+1 this is what brunch does. Temporary files are a lot of IO work and crappy in terms of perf.

paulmillr commented 11 years ago

I’ve been working on pipeline for two days already, will release something hackable today. My task-compatible module called pimpline will be able to do incremental compilation (a feature long requested in grunt and existed in brunch) and will build stuff without temporary files at all.

paulmillr commented 11 years ago

done. proof-of-concept implementation of async-IO promise-based pipeline with support for incremental compilation is available @ https://github.com/paulmillr/pimpline

Like this:

# “Input” is a simple object. Minimal example: {path: 'a.js'}.
list()          # List inputs. Returns {path: '...', stats: file-stats}.
  .then(read)   # Read inputs. Adds data prop to input. => {path, stats, data}
  .then(copy)   # Copy inputs. Passes data to next input unchanged.
  .then(jshint) # JSHint inputs. Stops pipeline if needed. => {path, stats, data}
  .then(uglify) # Minify inputs. Changes `data` in-place. 
  .then(concat) # Concatenate inputs with config, sort. Generates new inputs.
  .then(write)  # Write the inputs.

Any feedback will be appreciated.

tkellen commented 11 years ago

This looks like a really promising start. It was my initial hope that file-based tasks could be entirely self contained, but I totally see the value in factoring the code this way. I'm going to be buried this week, but I hope to pick up where we left off on this discussion next week.

Munter commented 11 years ago

This looks remarkably similar to what we are doing in assetgraph.

We built on the idea of populating the dependency graph of a web page in one go (reading all files) and then apply a number of transformations to the graph before serializing to disk again.

Our most elaborate example of this can be seen in our build-production transformation that applies pretty much every optimization we could think of: https://github.com/One-com/assetgraph-builder/blob/master/lib/transforms/buildProduction.js

There might be grounds for some collaboration here if I am not misunderstanding where you are trying to take this project.

tkellen commented 11 years ago

We're all duplicating a lot of effort, that is certain. I'm really hopeful we can invent a wheel everyone would be happy using, that way we can all focus on the problems we care about (rather than implementation details) and know that our work will benefit everyone in the community. I am still sprinting my way to finishing a big project (aiming for the end of next week). I will be drawing up a spec for an "inputbuffer" api that I hope we can all agree on soon.

tkellen commented 11 years ago

There is a draft for an input buffer API on the README now. I'd love some feedback.

tkellen commented 11 years ago

So, right now, after Grunt does it's glob expansion magic, we wind up with something like this:

var oneToOne = [
  { src: ['test/fixtures/foo.coffee',
          'test/fixtures/bar.coffee',
          'test/fixtures/baz.coffee'],
    dest: 'tmp/combined.js'
  }
];

var manyToOne = [
  { src: ['text/fixtures/foo.coffee'], dest: 'tmp/foo.js' },
  { src: ['text/fixtures/bar.coffee'], dest: 'tmp/bar.js' },
  { src: ['text/fixtures/baz.coffee'], dest: 'tmp/baz.js' }
];

Assuming we had a dedicated task for reading input, as @cowboy has suggested, the basic unit of work it would perform is this:

var when = require('when');
var FileBuffer = require('filebuffer');

var bufferInput = function (input) {
  // clone input object and iterate over it, buffering each src/dest pair
  var fileSet = _clone(input, true).input.map(function(set) {
    // bufferize the source files for this set
    var sources = set.src.map(function(file) {
      return FileBuffer.create('utf8').load(file);
    });
    // when all sources are buffered, replace the src array with them
    return when.all(sources).then(function(buffers) {
      set.src = buffers;
      return set;
    });
  });
  // return a promise representing the resolution of each src/dest pair
  return when.all(fileSet);
};

Which resolves to:

oneToOne = [
  { src: [<FileBuffer>,<FileBuffer>,<FileBuffer>],
    dest: 'tmp/combined.js'
  }
];

manyToOne = [
  { src: [<FileBuffer>], dest: 'tmp/foo.js' },
  { src: [<FileBuffer>], dest: 'tmp/bar.js' },
  { src: [<FileBuffer>], dest: 'tmp/baz.js' }
];

The interesting part about this is that we could have one universal read task, and one universal write task, but allow them to specify an optional read/write interface. If you were using the reader task and you didn't want to get stuff from the local disk, as in my FileBuffer implementation (super naive at this point, just for proof of concept testing), you could specify one for s3, http, etc.

tkellen commented 11 years ago

Pretty much all of this is implemented in various node-task repos now. I've also moved the spec to the wiki. I'll be posting a big issue with a request for feedback soon.