local mode works but not hadoop mode

Not sure if this a full fledged issue but I'm posting it here because the google group doesn't appear to be very active.

In short, the examples work fine in local mode but not hadoop mode.

This works as expected:

wu-hadoop examples/word_count.rb --mode=local --input=examples/sonnet_18.txt

However, when I switch it over to the single node hadoop cluster which is running locally it fails.

I've put sonnet_18.txt into the hdfs and normal hadoop jar examples such as pi calculation works fine.

Command:

wu-hadoop examples/word_count.rb --mode=hadoop --input=/user/ScotterC/sonnet_18.txt --output=/user/ScotterC/word_count.tsv --rm

I get

Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.

and the job details prints:

java.lang.RuntimeException: Error in configuring object
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

This is with hadoop 1.1.1, wukong 3.0.0.pre3 and wukong-hadoop 0.0.2

If anyone has pointers for debugging Java it would be highly appreciated.

1st EDIT:

Found that further down the stack trace there is:

Caused by: java.io.IOException: Cannot run program "wu-local": error=2, No such file or directory
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
    at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)

I'm using RVM so my wu-local is located /Users/ScotterC/.rvm/gems/ruby-1.9.3-p194/bin/wu-local

Writing out full paths for wu-local got me to my next error

wu-hadoop examples/word_count.rb --mode=hadoop --input=sonnet_18.txt --output=word_count.tsv --rm --map_command='/Users/ScotterC/.rvm/gems/ruby-1.9.3-p194/bin/wu-local /Users/ScotterC/code/github/wukong-hadoop/examples/word_count.rb --run=mapper' --reduce_command='/Users/ScotterC/.rvm/gems/ruby-1.9.3-p194/bin/wu-local /Users/ScotterC/code/github/wukong-hadoop/examples/word_count.rb --run=reducer'

Error:

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 127
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

2nd Edit:

Digging through Hadoop job logs gives me

env: ruby_noexec_wrapper: No such file or directory

So I'm wondering why it can't find my ruby implementation. I would assume that's giving the subprocess failed.

I believe this is related to https://github.com/infochimps-labs/wukong/issues/11.

Wukong::Hadoop::HadoopInvocation#ruby_interpreter_path would appear to be setting the right ruby path but I can't find where that method is called.

The only other clue I have is that my stderr job logs also show Unable to load realm info from SCDynamicStore which I believe was fixed with the right export in hadoop-env.sh and has stopped showing in my stdout so maybe the hadoop being loaded is loading a different environment.

I'm starting to become a bit more convinced that the issue is lying somewhere within how wu-hadoop is loading the environment. I wrote some straight ruby scripts and ran it through hadoop streaming

hadoop jar /usr/local/Cellar/hadoop/1.1.1/libexec/contrib/streaming/hadoop-*streaming*.jar \
  -file /Users/ScotterC/disco/hadoop-ruby/mapper.rb    -mapper /Users/ScotterC/disco/hadoop-ruby/mapper.rb \
  -file /Users/ScotterC/disco/hadoop-ruby/reducer.rb   -reducer /Users/ScotterC/disco/hadoop-ruby/reducer.rb \
  -input sonnet_18.txt -output word_count_ruby.tsv

These worked fine so I'm guessing that the gemfile is somehow not getting loaded or getting unset somehow with normal wu-hadoop.

BTW, sorry for the stream of conscious type issue posting but I'm hoping this will be useful to others who are having the same issues and will be useful google fodder.

Are you running on the 1.0-ish branch of Hadoop (chd3 / 0.20 etc) or the 2.0ish branch (cdh4)?

Does Hadoop streaming work for you at all?

What do the log files from the child process say (you get those by clicking through the tasks on the job tracker ui, or by drilling into the no -world readable dirs in /var/log/hadoop)

Sent from my iPad

On Feb 7, 2013, at 11:25 AM, Scott Carleton notifications@github.com wrote:

Not sure if this a full fledged issue but I'm posting it here because the google group doesn't appear to be very active.

In short, the examples work fine in local mode but not hadoop mode.

This works as expected:

wu-hadoop examples/word_count.rb --mode=local --input=examples/sonnet_18.txt However, when I switch it over to the single node hadoop cluster which is running locally it fails.

I've put sonnet_18.txt into the hdfs and normal hadoop jar examples such as pi calculation works fine.

Command:

wu-hadoop examples/word_count.rb --mode=hadoop --input=/user/ScotterC/sonnet_18.txt --output=/user/ScotterC/word_count.tsv --rm I get

Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. and the job details prints:

java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136) at org.apache.hadoop.mapred.Child.main(Child.java:249)

This is with hadoop 1.1.1, wukong 3.0.0.pre3 and wukong-hadoop 0.0.2

If anyone has pointers for debugging Java it would be highly appreciated.

— Reply to this email directly or view it on GitHub.

Just figured it out! Of course in the end it was simple.

Using 1.1.1 hadoop. I had [[ -s "/Users/ScotterC/.rvm/scripts/rvm" ]] && source "/Users/ScotterC/.rvm/scripts/rvm" in my hadoop-env.sh but I guess I must have not restarted hadoop after doing that. The other factor was needing a gemfile in the folder so that wukong would pick it up and set cmdenv.

After adding the rvm script to the end of my hadoop bash environment it appears to be working. What are your thoughts on throwing a warning if the gemfile isn't there? I guess it wouldn't make sense if wukong was installed globally. Not sure if you guys use rvm much but I'd be happy to add a thought to the docs saying 'if you're using rvm, make sure to add it to hadoop env and restart the cluster.'

tl;dr:

The hadoop scripts should be modified to call Bundler.setup(...), where ... is the minimal set of Gemfile groups for script-running.
the -file functionality is dangerous and should be reverted out.
We will at some point need to implement an 'opinionated packager' function, but it's not here now. Until then, the best way to use wukong is with a network drive mounted on each worker node.

The -file stuff is really dangerous, and am going to check as to why it was made the default. It will only work for an utterly trivial example, and fail mysteriously otherwise.

Design parameters as I see them:

Somehow or another, you have to put all files the script will use in a place where the hadoop runner will see them:
- in a fixed location on each computer, set up identically
- in the job-local directory, by enumerating with the -file flag
- I think you can also specify a jar file, using I-forget-what hadoop commandline flag, to be un-jar'ed
- on a shared network drive eg NFS
If you are developing a script:
- you must not have to manually re-enumerate every inclusion for the runner to find. That is, I must be free to say "require_relative '../../crazy/unexpected/path/to/thing'" or "File.open(File.expand_path('../../data/table.tsv', FILE))" and not have to ALSO go list those paths in a separate manifest.
- you must not have to manually synchronize things across computers in order to launch
- you should not have to watch as a time-consuming package/upload/eventually unpackage step is added to each job launch
- you should be allowed to have paths that lie outside the local app directory tree
If you are running in production:
- you must be confident that all and only what you expect to be running is running.

The only two workable solutions I'm aware of are:

network drive: mount a network drive to each machine using eg. NFS.
- this works and it's what we've always used
- requires that you set up NFS; I argue that doing so is straightforward for anyone skilled enough to deploy Hadoop.
- it leads to no hassles as far as relative paths, etc -- any NFS-volume path on machine A is the path on machine B
- I can have paths that point outside my app's local directory tree (which is good for development, unhelpful for production)
- it requires that I have access to the cluster's NFS
- zero package/upload time.
opinionated packager: prescribe that everything within a certain directory tree should be packaged into a jar, and that your scripts must lie in a known location ('app/jobs' or whatever). I can also somehow specify additional directory trees for inclusion (eg. a local checkout of a gem or other library that isn't in my app tree)
- needs to be written
- no additional infrastructure required
- can now lock down access to only what's inside the tree; good for produciton, pain for dev.
- I now have to watch as things get packaged every single time I launch. debug loop time. hate.
- I can launch jobs without local cluster access

Specifying -file fails as soon as anything besides my script is launched (in this case, the Gemfile, but any file I required, or data file I load, or local checkout of a library, or...).

@dhruvbansal would you please revert out that behavior?

Thanks for the write up. Since I was quite new to map/reduce et al, I wanted to get it all working locally on my macbook before moving into a distributed setting but of course that makes dealing with the idiosyncrasies of my work station a pain as opposed to using chef recipes and getting it right the first time in a controlled environment. However I did learn a lot the hard way :)

If it's your intention to allow more developers to get up and running with wukong quickly then designing/implementing an opinionated packer could be a priority, or possibly just some very very comprehensive tutorial docs. Alternative could be a short doc on how to use ironfan to setup a vagrant instance so it would at least be a controlled environment. However, considering all the work involved in just making an awesome ruby wrapper to quickly code data flows and deploy them to a cluster, making sure it works perfectly in a local hadoop cluster should probably not be a priority.

@mrflip I'd love to have the opinionated packager option working but at present only the network drive approach is really feasible. wu-hadoop should work "as expected" if

in local mode
in Hadoop mode without a network given a single-file argument with zero code-loading. The dubious -file functionality supports this. Multiple files and code loading are brittle because the default Hadoop "packager" does some insane path munging that is difficult to work around (see @kornypoet's complaints...).
in Hadoop mode with a network drive and within the context of a deploy pack (as defined by wukong-deploy. The network drive ensures code is available and the deploy pack handles loading code correctly.

@ScotterC did one of the above "supported" use-cases not work for you?

I'd like a more robust model, but this is what we've got today. Agreed with @ScotterC that these constraints deserve better documentation.

@dhruvbansal I guess the supported case that ended up working for me was in hadoop mode without an NFS and munging the path loading to make sure everything needed was there. Just was very confusing due to the finicky nature of hadoop configuration and fully comprehending how it loads up. I believe the process will be much smoother when setup over a network.

Local mode is a breeze however which is really the strength of the library which allows you to test code before putting it out there and wasting compute cycles. I'm now going to proceed towards deploying a deploy-pack with ironfan which appears to be fairly well documented here. I'll let you know if I run into difficulties.

infochimps-labs / wukong-hadoop

local mode works but not hadoop mode #2