jakejs / jake

JavaScript build tool, similar to Make or Rake. Built to work with Node.js.
http://jakejs.com
Apache License 2.0
1.97k stars 190 forks source link

Parallel prerequisites Comments please #264

Closed veldsla closed 10 years ago

veldsla commented 10 years ago

We use jake to run and chain long running shell jobs and require many parallel processes to optimally use the server. We use functions that generate many jake file tasks to use as prerequisites and it would be great if those tasks run in parallel

This is an attempt to run the dependencies of a task in parallel. It uses caolan/async to run and limit the amount of simultaneous tasks. We needed to mess a bit with the invocationChain to accommodate this. It really requires an api change regarding the global complete function although it is currently backwards compatible as long as you don't use runparallel.

Would you be interested in a feature like this? Do you have any suggestions for improvement? We created an example jakefile here https://gist.github.com/veldsla/7d08c36ef30a504c91e0#file-jakepartest

mde commented 10 years ago

This is awesome, and something that people have requested for quite a while. I think the change to complete is fine, as long as it's backward compatible. I'll take a look -- would love to get this feature merged.

mde commented 10 years ago

A couple of things: execution isn't actually parallel in JS, since we only have the one thread of execution. A better word would be 'concurrent.' (This is relevant for the complete/this.complete, since only one task will ever be executing at one time.) It might be possible to flag the currently executing task, so that the global complete still works the same way. Instead of assuming it refers to the end of the execution chain, it would refer to the currently executing task.

veldsla commented 10 years ago

I'm aware JS is single threaded. As mentioned we use a lot of jake.exec calls which are asynchronous so many of the could be triggered simultaneously. Because it is unknown when the exec will finish the global complete can never finish the correct task. We encountered this during development when we used a series of file tasks and jake bailed out because the expected file wasn't (yet) written when complete was called. Or am I missing something obvious here?

Regarding the naming concurrent is probably a better choice. The concurrent tasks have to be asynchronous and preferably wait a lot in the asynchronous operations in order to gain from this feature. Like shelling out with jake.exec.

ben-ng commented 10 years ago

I think the runparallel option should be parallelLimit, since that's what this actually is.

Does this properly handle dependencies that depend on each other?

task A
task B depends on A
task C depends on [A, B] parallelLimit 2

Jake should know that B depends on A and not run it twice.

veldsla commented 10 years ago

I think it currently is possible to trick Jake into running a task twice, maybe if A runs longer than B it would be started again when we arrive at the B prereqs. I would like to do additional testcases, but I first wanted to get some feedback and see if there would any interest to merge such a feature.

mde commented 10 years ago

Oh, so the runner could enter another concurrent task before finishing the previous one. I should have thought of that. Given that, yes, the complete API would need to pass the task instead of the data value.

@ben-ng's point about prerequisites is a valid one. Seems like this would require a change from a simple "done" property like we have now to a "completeness" or "state" or similar that references an enumeration of UNSTARTED/RUNNING/DONE.

Even then, you would not be able to start B until A was done, which would mean there would be certain tasks which could not be run concurrently, even if listed as concurrent prereqs. Basically this changes a chain into a tree-structure where the nodes are visited bottom-up? I do think this would work even if there is a node which lives in multiple places in the tree. It might make circular prereqs easier to trip over, but it's a problem we have with the chain as well.

ben-ng commented 10 years ago

@veldsla: i think we can solve the dependency dependency problem by checking each dependency in the list and removing any subdependencies.

In my example above, C's dependency list would be reduced to B because A is a dependency of B.

This would be simpler than messing with states.

This would be an awesome feature, btw (:

veldsla commented 10 years ago

@mde I also thought about the started/finished state variable. I'm not sure @ben-ng's graph approach would work for us. We use Jake in a different way. Basically we have about three to four module tasks which internally generate several file_tasks, use those tasks as prereqs for a new task and invoke it. The module then waits for complete and moves on to the next module. So initially the task tree would be easy A->B->C, but A would generate A'->[a1,a2,a3,a4,....] and run it before completing and moving on to B.

This probably makes little sense :grinning:

mde commented 10 years ago

Ah, so you'd need to reduce all the concurrent prereqs to the end of whatever chain of sub-prereqs each might have -- effectively creating that tree, where each sibling level is a group of potentially concurrently running tasks.

Given: task('a', {async: true}); task('b' ['a'], {async: true}); task('d' ['b', 'e', 'f', 'g'], {async: true, prereqsConcurrent: true}); And: task('c', ['a', 'b', 'd'], {async: true, prereqsConcurrent: true});

Task c's final prereq list would be only 'd.' You could run d's prereqs in any order, but 'a' would have to run as a prereq of 'b.'

mde commented 10 years ago

@veldsla, are you using the Task API to add prereqs on the fly? Or are you just calling task and declaring new tasks?

veldsla commented 10 years ago

@mde That would be number two. I create several file('out', ['in'], {async:true} function() { jake.exec(...); }) and then do a task('new', ['out', 'out', ...], {async:true, parallel: 4}). I then access that task via the global jake variable, invoke it and wait for completion. This works fine, but if there's a better way I'm just getting started!

mde commented 10 years ago

Nope, that's totally legit. So you're really dealing with only one level of prereqs. That's fine, but whatever we implement needs to take deeper hierarchies into account.

veldsla commented 10 years ago

Goodmorning,

I created @ben-ng's testcase (https://gist.github.com/veldsla/634c84a223db02f1b55a#file-jakeduppre) and indeed A was run twice. I quickly introduced a started flag in the task which circumvents this problem 3d4b8034008b4ea5a00f14c5a1a9864ed521ecf7

ben-ng commented 10 years ago

Nice work!

mde commented 10 years ago

Hm, rather than just bandaiding it by adding another Boolean flag, this really should be a three-state variable.

veldsla commented 10 years ago

Sure no problem. Shall I have a go at cleaning it up? Maybe add a few tests? Would you merge it or do you have plans to implement this differently? Also I only use filetasks, tasks and rules. Do you expect this change to affect any of the other convenience tasks?

mde commented 10 years ago

Definitely needs some pretty extensive tests. I don't have plans to implement it another way, so we can certainly merge this branch once there's reasonable tests on it. All the convenience tasks (also Rules) are actually just built on top of Task (doing something very similar to what you're doing by dynamically creating them), and they won't be using the concurrent functionality.

veldsla commented 10 years ago

Slightly delayed, but I cleaned it up and added a few tests. I renamed runparallel to concurrentLimit, but I'm open to suggestions. The three state variable was implemented using strings because I was unsure where to declare a constant.

The tests cover two simple tasks running two dependencies concurrently and depending on these two tasks sequentially and concurrently. It uses the setTimeout to finish out of order and the order is asserted in the tests. Also I tried throwing an error and added @ben-ng double dependency.

Any remarks? Suggestions for other tests?

ben-ng commented 10 years ago

I think parallelLimit is more idiomatic -- async is one of the most depended upon modules for doing this, and that's what people are going to try first.

Very nice work solving the duplicate dependency problem!

mde commented 10 years ago

I was hoping to avoid the word 'parallel' since it's technically incorrect, but if the functionality is precisely the same as what people are used to in Async, then I agree it wouldn't be bad to bow to expedience.

mde commented 10 years ago

I added a couple of comments. Thanks so much for your work on this!

veldsla commented 10 years ago

I think it isn't possible to create the status object on TaskBase and still refer to in in File(Base), so I defined it on Task instead. Also added a small section to the docs.

mde commented 10 years ago

Perfect. Thanks so much for this. This will ship in Jake v8, which will be the next release, happening soon.