Add ConfigurationDefinition#after

jowl commented 9 years ago

As promised, here's a PR suggesting to add ConfigurationDefinition#after which would enable users to register callbacks to be run when the all jobs has finished. The callbacks are called with a RunResult instance with information about the run, e.g.

Rubydoop.run do |*args|

  # job setup

  after do |result|
    if result.success?
      # job finished successfully
    else
      # job failed
    end
  end
end

Taking this further, you could implement similar callbacks for individual jobs and sequences thereof, but I thought I'd start simple.

bjorne commented 9 years ago

Nice! Two thoughts:

an exception in an after callback will halt any subsequent callbacks as currently implemented. Noteworthy, but probably fine.
I think it would be useful if the after callbacks have some way of accessing the job instance, so that it can be queried for additional information upon failure (e.g. job.getState()). I'm a bit too tired right now to see immediately if that's the case but I can certainly try it myself if you agree about the usefulness.

jowl commented 9 years ago

@iconara commented on the error handling in the callbacks in my previous PR, but I forgot about it. It's fxed now though.

The latter idea might be good, even if I don't know how much useful info you could extract from a Job instance. The problem with implementing it is that there's no way of knowing which job failed from the ConfigurationDefinition, as it only has a reference to the Context, which in turn has a reference to a job stack, which has references to all the Job instances. One would have to thread the failing job all the way back out to the ConfigurationDefinition, either via #wait_for_completion or a new accessor #current_job or similar, to get it to work. It would probably be interesting to know whether the job was killed or failed from the state though.

iconara commented 9 years ago

Maybe a sepatate after for each job, and one for the whole application? (I'm not sure I'm for this suggestion, but I'm throwing it out there)

jowl commented 9 years ago

Yes, I mentioned something like that in the description as well. Then it could make sense to also implement #before for individual jobs (or maybe for job sequences).

jowl commented 9 years ago

Another idea would be to simply yield the job stack and let the user find out which job (or jobs if running in parallel) has failed.

iconara commented 9 years ago

Yes, I mentioned something like that in the description as well

Ah, now I saw.

A variant that means we can defer the decision on what to yield to later is to yield a result object with a #success? method. If we want to add more things later they can be added to this result object without breaking backwards compatibility.

jowl commented 9 years ago

Now we're talking. I really like that idea.

iconara commented 9 years ago

All problems can be solved by adding another layer of indirection.

grddev commented 9 years ago

If we should have before/after hooks, I would like for them to behave similarly to RSpec, otherwise I think it will be confusing. Thus, you would have to specify #after(:all) if you wanted a final callback, and #after would implicitly be #after(:each). An optional argument representing the job (for :each) and the whole execution (for :all) seems like a reasonable thing.

iconara commented 9 years ago

I don't agree with making it work like RSpec. In RSpec you use before and after to make sure that you get the same setup for every test. I don't see that as the primary (or even as a common) use case for this. You most likely want to do different things before and after different jobs.

grddev commented 9 years ago

I guess my problem is that I can't really see what you would use the before and after hooks in the configuration block for. I guess it is a matter of personal preference, but I prefer to have as little code as possible in the configuration block. For performing task-specific setup, I much prefer the solution in Humboldt, where we have #setup and #cleanup hooks for individual mappers and reducers.

I guess hooks could be useful for whole jobs (or collection of jobs) as well, but then I'd prefer to name them #setup and #cleanup, rather than #before and #after to avoid confusion.

iconara commented 9 years ago

They wouldn't be setup blocks though, and cleanup only in the sense that you can do things like cleaning up files on HDFS.

The primary use case for this, which might have gotten lost since this moved from another PR, and since the discussion also has been done mostly offline, is to make it possible to, for example, send a message over SNS when the application is done. This only requires after at the top level, and that's the only thing we should include from the start, IMO.

jowl commented 9 years ago

I don't really understand your point @grddev, you say

I can't really see what you would use the before and after hooks in the configuration block for

and then

I guess hooks could be useful for whole jobs

which I find a bit contradictory.

And I agree with @iconara about only including a top-level #after in this release. Although I'm not hell-bent on it being called #after. I don't dislike the name #on_completion.

grddev commented 9 years ago

@jowl, I don't really see the problem with my reasoning. I didn't know of a concrete use case, but I guessed that there was one. The second sentence was just to say that my reservation visavi the name confusion still applied.

@iconara, assuming there is no built-in way to get an SNS message when the job completes, that certainly seems like a reasonable use case. While #cleanup isn't the optimal name for the hook, it is the name used by Hadoop, and therefore the name used by Humboldt. But I seem to be in a minority here, so I'll rest my case. :)

jowl commented 9 years ago

Thanks for clarifying, I read something like "I don't think X is useful, but I guess it is", but now it makes sense.

jowl commented 9 years ago

And regarding your second point. Setup and cleanup are concepts within mappers and reducers, and I think calling it that might make it a bit confusing since it will be run in a completely different context.

jowl commented 9 years ago

I've now wrapped success in a RunResult, along with all the jobs so that @bjorne can extract whatever information he wants from them. This is an MVP for @bjorne's use case. I've also updated the PR description accordingly.

iconara commented 7 years ago

This was a nice idea, but given that we haven't merged it in two years I think it's better to close for now.

iconara / rubydoop

Add ConfigurationDefinition#after #34