Closed jowl closed 9 years ago
Updated description for using Puck v1.0.0.
At first I assumed that moving the JobRunner
class to Ruby would mean that we would start JRuby containers in JRuby containers, but I realize now that that's not the case. That removes my reservations for moving the job setup code to Ruby.
I've changed the DSL a bit, and replaced Rubydoop.configure
with Rubydoop.run
. The idea is that rather than having an lib/app.rb
, the user should create bin/app
containing a Rubydoop.run
block. This would then allow her to start her job using hadoop jar app.jar app args...
.
I've also realized that it doesn't work in a distributed setting; the reason for this is that the task nodes cannot find classes in rubydoop.jar
as that isn't on the classpath until after jar-bootstrap.rb
has been loaded. And the latter isn't loaded until the InstanceContainer
from rubydoop.jar
attempts to resolve classes. I do have however two potential solutions, neither of which I've gotten to work:
rubydoop.jar
at the root of the main JAR file. I haven't tried this, but I can't see why it wouldn't work.rubydoop.jar
to the classpath from the JobRunner
. On old Hadoops this could be achived with DistributedCache.addArchiveToClassPath
and on newer using Job#addArchiveToClassPath
. I did try this, but wasn't able to get the paths resolvable from both master and task nodes. It feels like it should be possible though.I won't be able to work with this for a few weeks now, but feel free to do whatever with this in the meantime.
For the record, this is what I tried when attempting to manually put rubydoop.jar
on the classpath:
# in JobRunner#run
rubydoop_path = JRuby.runtime.jruby_class_loader.get_resource('rubydoop').file # => file:/path/to/rubydoop.jar!/rubydoop
rubydoop_jar = rubydoop_path[/.*(?=!\/.*$)/] # => file:/path/to/rubydoop.jar
Java::OrgApacheHadoopFilecache::DistributedCache.add_archive_to_class_path(Java::OrgApacheHadoopFs::Path.new(rubydoop_jar), conf, Hadoop::Fs::FileSystem.get_local(conf))
I don't remember why Rubydoop.configure
was renamed Rubydoop.run
– I'm not against it, but we need to continue supporting Rubydoop.configure
. There's also at least one place where the documentation still says Rubydoop.configure
.
Is there anything else in this that would require a major version bump, or could we make the Rubydoop.configure
/Rubydoop.run
thing backwards compatible wiith an alias and release a new minor? Ich habe Hauptversionsnummernerhöhungsangst, as you know.
The name was changed since there's no longer a DSL method that only configures a job, the only entrypoint now is Rubydoop.run
which configures and runs a job. I thought we agreed on breaking backwards compatibility and move to a new major version. But maybe that wast just me.
However, since you raised the question again you got me thinking about whether it'd be possible to stay compatible or not. And that made me, regrettfully, realize that this is a step away from what we discussed offline, i.e. making it easier to work with the configuration objects directly and only having the DSL as a convinience feature. In its current state, this PR builds even more things into the DSL and thus makes it even harder to do useful stuff without it. This is not a conclusion I wanted to come to. But then again, rather now then right after a major version bump.
What I'm saying is that I think I need to rework the interface a bit, to decouple the DSL from the actual objects, and upon doing that determine whether it'd make sense to support Rubydoop.configure
. Which I'm guessing it will. Note that I have nothing against bumping the major version, and actually think that it's probably a good idea.
And no way you wrote that word without googling. Even @grddev couldn't pull that off.
Also, I realize that the documentation isn't up-to-date, but I'll make sure that it is as soon as we're able to determine a proper scope for this PR.
I have a keyboard shortcut for it.
I discovered something that probably needs to be handled: humboldt run-local
will run hadoop jar path/to.jar -conf humbold/config.xml job_config job_arg1 job_arg2
. This worked fine before since Hadoop's runner thing grabbed the -conf
flag before any of our code saw it. Now Puck will see the arguments before the Hadoop code does, and it thinks -conf
is the job to run. It's probably a simple fix in Humboldt, but it needs to be fixed.
Good catch.
Looks like just moving the arguments around so that the job name comes before -conf
works.
You could also add a binary called bin/-conf
to Rubydoop, and have it rearrange the arguments and run the intended script :) And then do the same for every other possible option for hadoop
.
These are the changes I've made since my realization yesterday and our offline discussion today: https://github.com/jowl/rubydoop/compare/5937d7a1f5e2e92deced3f658e1f47b246f0c133...puck
We discussed adding some kind of method for determining whether you're on the master node or not, but after reviewing the code again, I'm not sure where to put it. As it is right now JobRunner.run
(and consequently Rubydoop.run
) will do Java::JavaLang::System.exit
, and it's hard to do anything useful after that.
Ok, maybe we need to add before
and after
blocks to the DSL instead (but we can do that in a minor release later).
I liked the callback idea and had a stab at implementing #after
. The idea is that it could be used as
Rubydoop.run do
# job setup here
after do |success|
if success
# job was successful
else
# job failed
end
end
end
That is, the block is run after all jobs are completed why #before
wouldn't make much sense atm. That would be about the same as putting the code before Rubydoop.run
. With more involved callback mechanisms, e.g. with per job callbacks, it might make sense.
I thought about having #on_success
and #on_failure
instead, but figured I'd start with the simplest interface. But now that I look at it again, I'm starting to question that decision.
I think after
makes more sense than on_success
and on_failure
, at least for the use cases I have in mind. After a job you want to send a message so that other parts of your processing pipeline can start chugging. You want to send different messages on success and on failure, but the setup will probably be the same. In other words: if you use SNS or SES the setup is much more than sending the message, so being able to do that in one place instead of having to duplicate the setup or break it out into a method would be helpful.
That being said, let's wait with after
until this PR is merged. We can get it into 2.0, but we need to get this merged first.
Maybe #after
is more convinient anyway. I definitely agree with you that #after
shouldn't be in this PR, thanks for pointing it out clearly. It's always easy to add too much while at it.
I've removed the after callbacks from this PR now.
This PR suggests replacing the built-in JAR packaging with Puck (https://github.com/iconara/puck).
Puck creates JARs that run bin-scripts as specified by the first command-line argument, which won't play very well with Hadoop, why I have proposed https://github.com/iconara/puck/pull/16. This PR is based on being able to specify a default entrypoint in Puck somehow (currently through setting a Java system property).EDIT: It works with Puck v1.1.0, but breaks backwards compatibility.Moving to Puck causes a few things to change, primarily
:jar_path
to "relative path of the JAR file within the build directory".hadoop jar app.jar app_name ...
.jruby-complete.jar
, but adds support for using thejruby-jars
gem.The main change for users is that the job configuration file, which was previously in
lib/my_job.rb
should now be inbin/my_job
, and thatRubydoop.configure
has been replaced byRubydoop.run
. The latter will exit on completion, so any code after it will never be run.