iconara / rubydoop

Write Hadoop jobs in JRuby
220 stars 33 forks source link

Support running jobs in parallel from the Rubydoop job runner #29

Closed grddev closed 9 years ago

grddev commented 9 years ago

This introduces parallel blocks into the configuration DSL to allow jobs that should be executed in parallel. In addition, this adds support for nested sequence blocks inside the parallel blocks in order to be able to specify sequences of jobs to be executed in separate swim-lanes.

Unfortunately, I couldn't find any way to listen for completion events for the Hadoop Job class, which means the only simple way to implement support for the above mentioned nesting is by dispatching parallel jobs using threads. The thread overhead should probably be acceptable, though, so lets not worry about that until it becomes a real problem.

While the integration tests contain parallel, they do not really depend on the parallel nature. In practice, they probably also do not run in parallel, as the local Hadoop cluster probably does not execute more than one job at a time.

In order to simplify the implementation of this, I shifted the job runner logic from Java to Ruby.

iconara commented 9 years ago

I understand that it would be hard to assert that two jobs ran in parallel, but it would still be nice with an integration test that at least exercised the DSL in a semi-real environment. Something with a few nested sequence/parallel blocks.

grddev commented 9 years ago

Admittedly, the current system test is horrible, but I couldn't really come up with a good example. My motivation for the feature is essentially to allow running the same task taking advantage of pre-partitioned input, in order to reduce times, but that doesn't really translate well to a small system test. I will have to think about this a bit.