iconara / rubydoop

Write Hadoop jobs in JRuby
220 stars 33 forks source link

Can't get things working on EMR 5.0.3 #42

Open bwebster opened 7 years ago

bwebster commented 7 years ago

I'm prototyping out a solution to use Rubydoop on EMR. I have written some jobs which I can run fine locally. But when I try to execute it on EMR, I get the following error:

LoadError: no such file to load -- rubydoop
  require at org/jruby/RubyKernel.java:956
  require at uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:55
Exception in thread "main" org.jruby.embed.InvokeFailedException: (LoadError) no such file to load -- rubydoop
    at org.jruby.embed.internal.EmbedRubyObjectAdapterImpl.call(EmbedRubyObjectAdapterImpl.java:320)
    at org.jruby.embed.internal.EmbedRubyObjectAdapterImpl.callMethod(EmbedRubyObjectAdapterImpl.java:250)
    at org.jruby.embed.ScriptingContainer.callMethod(ScriptingContainer.java:1412)
    at rubydoop.InstanceContainer.getRuntime(InstanceContainer.java:30)
    at rubydoop.RubydoopJobRunner.run(RubydoopJobRunner.java:25)
    at rubydoop.RubydoopJobRunner.run(RubydoopJobRunner.java:18)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at rubydoop.RubydoopJobRunner.main(RubydoopJobRunner.java:50)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: org.jruby.exceptions.RaiseException: (LoadError) no such file to load -- rubydoop
    at org.jruby.RubyKernel.require(org/jruby/RubyKernel.java:956)
    at RUBY.require(uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:55)

In terms of setup, I have run rake package and then uploaded the jar to an s3 bucket. When configuring my cluster step, I am able to select the custom jar. I then pass the following arguments: test abc123. test is the name of the job script (test.rb) and abc123 is the argument I want to pass.

The main class I'm trying to execute is lib/test.rb and looks like this:

require "rubydoop"
require "fileutils"
# ... other requires contained in lib/

Rubydoop.configure do |instance_id|
  # do stuff
end

Any ideas?

iconara commented 7 years ago

Usually that error is indicative of some problem loading your code, like a syntax error, missing constant, etc. It gets reported really confusingly as a failure to find rubydoop, but it should be interpreted more as an error that was raised while rubydoop loaded (and it loads your code).

You could try running locally and set -Djruby.log.exceptions=true -Djruby.log.backtraces=true to see all Ruby errors. There will be lots and lots of them, but keep an eye out for any that look like they could be from loading your code.

bwebster commented 7 years ago

Thanks for the quick feedback. I'll give that a shot today and see what I can find.

The odd thing is that it works fine when running against the same version of hadoop, installed locally via brew.

bwebster commented 7 years ago

Sorry, I'm a bit rusty on using hadoop. I've tried all of the following when running locally and I'm not seeing any sort of errors or traces (I was expected to see a lot based on your comment):

hadoop jar build/test.jar -D mapred.child.java.opts="-Druby.log.exceptions=true -Druby.log.backtraces=true" test abc123

hadoop jar build/test.jar -Druby.log.exceptions=true -Druby.log.backtraces=true test abc123

hadoop jar build/test.jar test abc123 -Druby.log.exceptions=true -Druby.log.backtraces=true
iconara commented 7 years ago

The last command should be the one, but it should be -Djruby… not -Druby….

I tried running this past my colleagues, but none of them had tried EMR 5, so we unfortunately don't have any experience if it turns out to be an EMR 5 thing.

bwebster commented 7 years ago

Good catch on the -Djruby fail. I've fixed that.

I've narrowed my code down to something very simple. I have a lib/test.rb file, which looks like this:

puts "Running test"

I then rake package, upload my jar to s3, and then add a step using that custom jar and pass the following options:

test 
-Djruby.log.exceptions=true 
-Djruby.log.backtraces=true

Even with that very simple setup, I'm still getting

LoadError: no such file to load -- rubydoop
  require at org/jruby/RubyKernel.java:956
  require at uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:55
Exception in thread "main" org.jruby.embed.InvokeFailedException: (LoadError) no such file to load -- rubydoop
    at org.jruby.embed.internal.EmbedRubyObjectAdapterImpl.call(EmbedRubyObjectAdapterImpl.java:320)
    at org.jruby.embed.internal.EmbedRubyObjectAdapterImpl.callMethod(EmbedRubyObjectAdapterImpl.java:250)
    at org.jruby.embed.ScriptingContainer.callMethod(ScriptingContainer.java:1412)
    at rubydoop.InstanceContainer.getRuntime(InstanceContainer.java:30)
    at rubydoop.RubydoopJobRunner.run(RubydoopJobRunner.java:25)
    at rubydoop.RubydoopJobRunner.run(RubydoopJobRunner.java:18)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at rubydoop.RubydoopJobRunner.main(RubydoopJobRunner.java:50)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: org.jruby.exceptions.RaiseException: (LoadError) no such file to load -- rubydoop
    at org.jruby.RubyKernel.require(org/jruby/RubyKernel.java:956)
    at RUBY.require(uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rubygems/core_ext/kernel_require.rb:55)

What version of EMR are you successfully using? Are there any other techniques you are using to get better diagnostic info?

Thanks for the help.

iconara commented 7 years ago

We run Rubydoop-based jobs on EMR 3.9.0 and 4.2.0. I don't think it's because we haven't gotten it to work on 5, more that we haven't built a new one since 5 was released and haven't had a reason to test.

Which version of Rubydoop are you using, and which version of JRuby?

bwebster commented 7 years ago

Here is my setup. I'm going to try running on 4.2.0 to see if that gets things going.

I've been putting my jar in s3, and all my input files are in s3. Couple questions about that:

Gemfile

gem "rubydoop", "1.2.1"

group :development do
  gem "rake"
  gem "jruby-jars", "= 9.1.5.0"
end

.ruby-version

jruby-9.1.5.0

Java

java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

JRuby

jruby 9.1.5.0 (2.3.1) 2016-09-07 036ce39 Java HotSpot(TM) 64-Bit Server VM 25.45-b02 on 1.8.0_45-b14 +jit [darwin-x86_64]

Hadoop

Hadoop 2.7.2
bwebster commented 7 years ago

I've taken a step back and created a new project that follows the word count example in the README for v1.2.1.

I'm going to try and get that working on EMR 4.2.0 and go from there.

enifsieus commented 6 years ago

This issue is fairly old, but I'm running into the same thing. I'm modernizing a project that is several years old to use the latest AWS SDKs, EMR release and jruby 9.1.16.0 (previously ran on 1.7.20). I've isolated the change to the jruby upgrade - 9.1 passes our specs, but does not pass rubydoop specs (on either of master or the v1.2.x branch) and generates this error in specs as well as AWS.

iconara commented 6 years ago

@enifsieus sorry to hear that. I think that I would need some help to get Rubydoop up to date for newer Hadoop and JRuby versions. I have mostly moved away from Hadoop and only have some legacy jobs that still use Rubydoop.