iconara / rubydoop

Write Hadoop jobs in JRuby
220 stars 33 forks source link

Unique output paths #26

Closed jowl closed 9 years ago

jowl commented 9 years ago

I've managed to have Hadoop jobs fail due to using the same intermediate path for multiple steps on more than one occasion, and have thought a bit about how to avoid it. I'm not sure how much this particular solution actually helps with the problem, but I thought it might at least serve as a decent start for a discussion.

The idea is that you should be able to configure jobs with multiple steps like:

Rubydoop.configure do |input_path, output_path|
  first_step = job 'first-step' do
    input input_path
    output intermediate: true
    # rest of definition
  end

  second_step = job 'second-step' do
    input first_step.output
    output intermediate: true
    # rest of definition
  end

  job 'third-step' do
    input second_step.output
    output output_path
    # rest of definition
  end
end

The argument handling in JobDefinition#output is rather hacky, as it can be used as both a getter and a setter, and with a semi-optional dir argument. But I figured that it is better to make it complex than to make using it complex. Then again, I wouldn't mind it not being so hacky.

iconara commented 9 years ago

I kind of like the idea, but I'm a bit torn about whether or not I want this in Rubydoop. I'd like to get rid of the job definition DSL completely, so adding to it might not be right. On the other hand, this feature would probably fit with whatever it's replaced by.

grddev commented 9 years ago

I think the gain over something like the following would be quite small. Given that the DSL should be reworked, I'm not sure the added functionality makes sense.

  second_tmp_path = Dir::Tmpname.make_tmpname('second', nil)
  job 'second-step' do
    input first_path
    output second_tmp_path
  end

  job 'third-step' do
    input second_tmp_path
    output output_path
  end
jowl commented 9 years ago

I thought about the same thing as well, but I like my solution a bit better for two reasons; firstly, it hides away the "unique" path generation from the user, and secondly, it makes it a little bit harder to make mistakes, someone might for instance use another variable for the output path of the second step in your example without changing the definition of the third step.

Having that said, I don't know whether this should be merged or not, given that the DSL may be removed soon. But as @iconara said, this should fit with a non-DSL based interface as well.

iconara commented 9 years ago

What's the consensus of this? I don't mind including this in a v1.2 release. A v2.0 with a new DSL isn't exactly round the corner, so if this helps people I'm not against it.

iconara commented 9 years ago

Merged in 3d3c672ebb21638b1e0548ae429c7e589e5cf654. I also update the tests since #28 upgraded RSpec (see dd90f1a385bba7d5ea74278238e91aadc9b1a5e0).