iconara / rubydoop

Write Hadoop jobs in JRuby
220 stars 33 forks source link

Add an :async option to the `job` dsl method #10

Closed colinmarc closed 10 years ago

colinmarc commented 11 years ago

Per email, this allows the user to specify whether a job should be submitted asynchronously.

My editor also automatically cleaned up some whitespace - I hope that's ok.

iconara commented 11 years ago

This doesn't work. Run the tests before you submit a pull requests, and write tests for your new functionality.

You need to write tests for the DSL additions, and there should be an integration test for it too (I don't think I've included an integration test that explicitly shows that jobs are run in sequence, but now when there is a way of not running them in sequence it's definitely needed).

If you don't write tests I the next time I get a patch I will have no way of telling if they broke things or if you did.

I don't like how the configureJobs method got the side effect of submitting jobs, that should be done in run, where they are waited for.

Remove the whitespace commit. The whitespace fixes are not wrong as such, but they hide the contents of your patch. Only change whitespace if you change something significant in the code around it.

colinmarc commented 11 years ago

Ah, sorry, I couldn't get the tests to run, so I did test manually (it does work, at least). I'll endeavor to fix those, and I'll make the other changes. Thanks!

colinmarc commented 11 years ago

I fixed the tests, and added one for the dsl param - sorry about that!

Any suggestions for how to test that the jobs are being submitted asynchronously? I could just test that it works generally, but I'm not sure how to specifically identify whether they were submitted asynchronously or are being waited on.

colinmarc commented 11 years ago

And there are the java bits. Please let me know if they need tweaking, java isn't really my strong suit.

iconara commented 11 years ago

Looks to me like an async job defined after a non-async job would not be launched until after the non-async job was done.

For example:

job 'first_job' do
  ...
end

job 'second_job', async: true do
  ...
end

The first thing that would happen would be that the master would block on waitForCompletion of the first job, then when that job had completed it would launch the second job, even though that job was asynchronous.

I think you need to clearly define what the semantics of mixing sequential and async jobs are.


You could test that non-async jobs are run in sequence by having a job that works on the output of a previous one. To test async jobs you could maybe check the timestamps of the outputs, or do something where one async job read the output of the other, but since the other one would run in parallel the output would not be there yet (sort of testing that the sequential behaviour is not there). Since you want this in Rubydoop you must have a use case for what you want it for, simplify it and use it as an integration test.

colinmarc commented 11 years ago

Yeah, that's the behavior I was going for - see the commit message. What would you prefer the semantics to be?

iconara commented 11 years ago

I don't think it makes any sense. If a job is async it should not wait for other jobs.

iconara commented 11 years ago

Before you spend too much more time on this, could you please elaborate a bit more on why you need this feature? I can't really think of a situation where you'd want this without being able to wait for all jobs, or to build dependency graphs of jobs.

colinmarc commented 11 years ago

The specific case has to do with loading a bunch of tables into HBase. Each table has to be its own MR Job, but it makes sense to submit them all at once, and have the cluster parallelize them as much as it can.

So I don't need to mix async and sync jobs - I only need async jobs. Maybe there should just be a global setting?

iconara commented 11 years ago

I'm not sure a global setting would be better. I want the semantics to be clear, and I want this to be generally usable, not supporting just a single or one-off case.

colinmarc commented 11 years ago

Hm, ok. As an alternate idea, how about a pull request that separates out the configuration stuff and provides some public java interfaces? Then you could build your own runners that used rubydoop.RubydoopJob or something.

iconara commented 11 years ago

Something that made it possible to configure jobs more like you do when you write Java MR jobs, and not using the Rubydoop DSL would be good. I think that the configure DSL is too constraining.

The reason why there is a DSL and that it needs to run within Rubydoop.configure { } is that this code will run when jobs are submitted, but it will also run on each task tracker before the mapper or reducer is created. It needs to run then, because I see no other way to make sure that all the files that need to have been required have been required (since the configuration DSL forces you to require your mappers and reducers it also tricks you into setting up everything correctly).


You're welcome to give your suggestion a stab, but as you've noticed I'm very picky about what I merge. If it's well tested and it benefits the project I will merge it, but not until then. You're also welcome to continue working on this patch until it feels like it's something that would take the project forward and not limit the possibilities of adding other features in the future (like a dependency graph for jobs).

colinmarc commented 11 years ago

Totally understood. Thanks for spending the time on this!