iconara / rubydoop

Write Hadoop jobs in JRuby
220 stars 33 forks source link

Use Puck for packaging #32

Closed jowl closed 9 years ago

jowl commented 9 years ago

This PR suggests replacing the built-in JAR packaging with Puck (https://github.com/iconara/puck).

Puck creates JARs that run bin-scripts as specified by the first command-line argument, which won't play very well with Hadoop, why I have proposed https://github.com/iconara/puck/pull/16. This PR is based on being able to specify a default entrypoint in Puck somehow (currently through setting a Java system property). EDIT: It works with Puck v1.1.0, but breaks backwards compatibility.

Moving to Puck causes a few things to change, primarily

The main change for users is that the job configuration file, which was previously in lib/my_job.rb should now be in bin/my_job, and that Rubydoop.configure has been replaced by Rubydoop.run. The latter will exit on completion, so any code after it will never be run.

jowl commented 9 years ago

Updated description for using Puck v1.0.0.

iconara commented 9 years ago

At first I assumed that moving the JobRunner class to Ruby would mean that we would start JRuby containers in JRuby containers, but I realize now that that's not the case. That removes my reservations for moving the job setup code to Ruby.

jowl commented 9 years ago

I've changed the DSL a bit, and replaced Rubydoop.configure with Rubydoop.run. The idea is that rather than having an lib/app.rb, the user should create bin/app containing a Rubydoop.run block. This would then allow her to start her job using hadoop jar app.jar app args....

I've also realized that it doesn't work in a distributed setting; the reason for this is that the task nodes cannot find classes in rubydoop.jar as that isn't on the classpath until after jar-bootstrap.rb has been loaded. And the latter isn't loaded until the InstanceContainer from rubydoop.jar attempts to resolve classes. I do have however two potential solutions, neither of which I've gotten to work:

I won't be able to work with this for a few weeks now, but feel free to do whatever with this in the meantime.

jowl commented 9 years ago

For the record, this is what I tried when attempting to manually put rubydoop.jar on the classpath:

# in JobRunner#run
rubydoop_path = JRuby.runtime.jruby_class_loader.get_resource('rubydoop').file # => file:/path/to/rubydoop.jar!/rubydoop
rubydoop_jar = rubydoop_path[/.*(?=!\/.*$)/] # => file:/path/to/rubydoop.jar
Java::OrgApacheHadoopFilecache::DistributedCache.add_archive_to_class_path(Java::OrgApacheHadoopFs::Path.new(rubydoop_jar), conf, Hadoop::Fs::FileSystem.get_local(conf))
iconara commented 9 years ago

I don't remember why Rubydoop.configure was renamed Rubydoop.run – I'm not against it, but we need to continue supporting Rubydoop.configure. There's also at least one place where the documentation still says Rubydoop.configure.

Is there anything else in this that would require a major version bump, or could we make the Rubydoop.configure/Rubydoop.run thing backwards compatible wiith an alias and release a new minor? Ich habe Hauptversionsnummernerhöhungsangst, as you know.

jowl commented 9 years ago

The name was changed since there's no longer a DSL method that only configures a job, the only entrypoint now is Rubydoop.run which configures and runs a job. I thought we agreed on breaking backwards compatibility and move to a new major version. But maybe that wast just me.

However, since you raised the question again you got me thinking about whether it'd be possible to stay compatible or not. And that made me, regrettfully, realize that this is a step away from what we discussed offline, i.e. making it easier to work with the configuration objects directly and only having the DSL as a convinience feature. In its current state, this PR builds even more things into the DSL and thus makes it even harder to do useful stuff without it. This is not a conclusion I wanted to come to. But then again, rather now then right after a major version bump.

What I'm saying is that I think I need to rework the interface a bit, to decouple the DSL from the actual objects, and upon doing that determine whether it'd make sense to support Rubydoop.configure. Which I'm guessing it will. Note that I have nothing against bumping the major version, and actually think that it's probably a good idea.

And no way you wrote that word without googling. Even @grddev couldn't pull that off.

jowl commented 9 years ago

Also, I realize that the documentation isn't up-to-date, but I'll make sure that it is as soon as we're able to determine a proper scope for this PR.

iconara commented 9 years ago

I have a keyboard shortcut for it.

iconara commented 9 years ago

I discovered something that probably needs to be handled: humboldt run-local will run hadoop jar path/to.jar -conf humbold/config.xml job_config job_arg1 job_arg2. This worked fine before since Hadoop's runner thing grabbed the -conf flag before any of our code saw it. Now Puck will see the arguments before the Hadoop code does, and it thinks -conf is the job to run. It's probably a simple fix in Humboldt, but it needs to be fixed.

jowl commented 9 years ago

Good catch.

iconara commented 9 years ago

Looks like just moving the arguments around so that the job name comes before -conf works.

jowl commented 9 years ago

You could also add a binary called bin/-conf to Rubydoop, and have it rearrange the arguments and run the intended script :) And then do the same for every other possible option for hadoop.

jowl commented 9 years ago

These are the changes I've made since my realization yesterday and our offline discussion today: https://github.com/jowl/rubydoop/compare/5937d7a1f5e2e92deced3f658e1f47b246f0c133...puck

We discussed adding some kind of method for determining whether you're on the master node or not, but after reviewing the code again, I'm not sure where to put it. As it is right now JobRunner.run (and consequently Rubydoop.run) will do Java::JavaLang::System.exit, and it's hard to do anything useful after that.

iconara commented 9 years ago

Ok, maybe we need to add before and after blocks to the DSL instead (but we can do that in a minor release later).

jowl commented 9 years ago

I liked the callback idea and had a stab at implementing #after. The idea is that it could be used as

Rubydoop.run do

  # job setup here

  after do |success|
    if success
      # job was successful
    else
      # job failed
    end    
  end
end

That is, the block is run after all jobs are completed why #before wouldn't make much sense atm. That would be about the same as putting the code before Rubydoop.run. With more involved callback mechanisms, e.g. with per job callbacks, it might make sense.

I thought about having #on_success and #on_failure instead, but figured I'd start with the simplest interface. But now that I look at it again, I'm starting to question that decision.

iconara commented 9 years ago

I think after makes more sense than on_success and on_failure, at least for the use cases I have in mind. After a job you want to send a message so that other parts of your processing pipeline can start chugging. You want to send different messages on success and on failure, but the setup will probably be the same. In other words: if you use SNS or SES the setup is much more than sending the message, so being able to do that in one place instead of having to duplicate the setup or break it out into a method would be helpful.

iconara commented 9 years ago

That being said, let's wait with after until this PR is merged. We can get it into 2.0, but we need to get this merged first.

jowl commented 9 years ago

Maybe #after is more convinient anyway. I definitely agree with you that #after shouldn't be in this PR, thanks for pointing it out clearly. It's always easy to add too much while at it.

jowl commented 9 years ago

I've removed the after callbacks from this PR now.