klbostee / dumbo

Python module that allows one to easily write and run Hadoop programs.
http://projects.dumbotics.com/dumbo
1.04k stars 146 forks source link

VirtualEnv + Hadoop CDH3B4 mode + Dumbo = import site error (was: StreamJob fails -- error finding typedbytes.pyc even though it exists) #28

Closed brainstorm closed 13 years ago

brainstorm commented 13 years ago

Hello,

I'm having the same issue described in this thread, but with the more recent CDH3B4:

http://groups.google.com/group/dumbo-user/browse_thread/thread/d5440880a5588278

Namely:

EXEC: HADOOP_CLASSPATH=":$HADOOP_CLASSPATH" $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.2-CDH3B4.jar -input '$HDFS_USERDIR/access.log' -output 'ipcounts' -cmdenv 'dumbo_mrbase_class=dumbo.backends.common.MapRedBase' -cmdenv 'dumbo_jk_class=dumbo.backends.common.JoinKey' -cmdenv 'dumbo_runinfo_class=dumbo.backends.streaming.StreamingRunInfo' -mapper 'python -m ipcount map 0 262144000' -reducer 'python -m ipcount red 0 262144000' -jobconf 'stream.map.input=typedbytes' -jobconf 'stream.reduce.input=typedbytes' -jobconf 'stream.map.output=typedbytes' -jobconf 'stream.reduce.output=typedbytes' -jobconf 'mapred.job.name=ipcount.py (1/1)' -inputformat 'org.apache.hadoop.streaming.AutoInputFormat' -outputformat 'org.apache.hadoop.mapred.SequenceFileOutputFormat' -cmdenv 'PYTHONPATH=common.pyc' -file '$HOME/ipcount.py' -file '$HOME/dumbo/dumbo/backends/common.pyc' -jobconf 'tmpfiles=$VIRTUALENV_INSTALL/typedbytes.pyc'
11/02/26 20:11:20 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [$HOME/ipcount.py, $HOME/dumbo/dumbo/backends/common.pyc, $HDFS_PATH//hadoop-unjar44840/] [] /tmp/streamjob44841.jar tmpDir=null
11/02/26 20:11:21 INFO mapred.JobClient: Cleaning up the staging area hdfs://$HOSTNAME/$HDFS_PATH//mapred/staging/romanvg/.staging/job_201102242242_0010
11/02/26 20:11:21 ERROR streaming.StreamJob: Error launching job , bad input path : File does not exist: $VIRTUALENV_INSTALL/typedbytes.pyc
Streaming Command Failed!

easy_install does not seem to be the reason. I've installed it via pip, easy_install, and now git clone. Seems to me that the jobconf 'tmpfiles' is what causes the problem.

Commenting the offending code allows the mapreduce job to start but fails shortly after, on the hadoop side (not dumbo):

diff --git a/dumbo/backends/streaming.py b/dumbo/backends/streaming.py
index 1f03b13..df2d3d5 100644
--- a/dumbo/backends/streaming.py
+++ b/dumbo/backends/streaming.py
@@ -180,15 +180,15 @@ class StreamingIteration(Iteration):
         hadenv = envdef('HADOOP_CLASSPATH', addedopts['libjar'], 'libjar', 
                         self.opts, shortcuts=dict(configopts('jars', self.prog)))
         fileopt = getopt(self.opts, 'file')
-        if fileopt:
-            tmpfiles = []
-            for file in fileopt:
-                if file.startswith('file://'):
-                    self.opts.append(('file', file[7:]))
-                else:
-                    tmpfiles.append(file)
-            if tmpfiles:
-                self.opts.append(('jobconf', 'tmpfiles=' + ','.join(tmpfiles)))
+#        if fileopt:
+#            tmpfiles = []
+#            for file in fileopt:
+#                if file.startswith('file://'):
+#                    self.opts.append(('file', file[7:]))
+#                else:
+#                    tmpfiles.append(file)
+#            if tmpfiles:
+#                self.opts.append(('jobconf', 'tmpfiles=' + ','.join(tmpfiles)))
         libjaropt = getopt(self.opts, 'libjar')
         if libjaropt:
             tmpjars = []

The result of the above modification is:

11/02/26 20:23:32 INFO streaming.StreamJob: map 0% reduce 0% 11/02/26 20:23:40 INFO streaming.StreamJob: map 50% reduce 0% 11/02/26 20:23:41 INFO streaming.StreamJob: map 100% reduce 0% 11/02/26 20:24:02 INFO streaming.StreamJob: map 100% reduce 17% 11/02/26 20:24:06 INFO streaming.StreamJob: map 100% reduce 0% 11/02/26 20:24:13 INFO streaming.StreamJob: map 100% reduce 33% 11/02/26 20:24:17 INFO streaming.StreamJob: map 100% reduce 0% 11/02/26 20:24:26 INFO streaming.StreamJob: map 100% reduce 17% 11/02/26 20:24:29 INFO streaming.StreamJob: map 100% reduce 0% 11/02/26 20:24:32 INFO streaming.StreamJob: map 100% reduce 100% (...)

11/02/26 20:24:32 ERROR streaming.StreamJob: Job not successful. Error: NA

brainstorm commented 13 years ago

The last "Job not successful failure" is closely related to the following post according to the hadoop job log:

http://www.curiousattemptbunny.com/2009/10/hadoop-streaming-javalangruntimeexcepti.html

Smells like virtualenv is causing trouble when running dumbo (cannot find the right version of python on the worker nodes ?).

I've tried hardcoding the sh-bang as the post suggests but didn't help :-S

brainstorm commented 13 years ago

The actual Hadoop exception is different from the one on the post:

ReduceAttempt TASK_TYPE="REDUCE" TASKID="task_201102242242_0018_r_000000" TASK_ATTEMPT_ID="attempt_201102242242_0018_r_000000_0" TASK_STATUS="FAILED" FINISH_TIME="1298749520679" HOSTNAME="$HOST" ERROR="java\.lang\.RuntimeException: PipeMapRed\.waitOutputThreads(): subprocess failed with code 2
    at org\.apache\.hadoop\.streaming\.PipeMapRed\.waitOutputThreads(PipeMapRed\.java:362)
    at org\.apache\.hadoop\.streaming\.PipeMapRed\.mapRedFinished(PipeMapRed\.java:572)
    at org\.apache\.hadoop\.streaming\.PipeReducer\.close(PipeReducer\.java:137)
    at org\.apache\.hadoop\.mapred\.ReduceTask\.runOldReducer(ReduceTask\.java:478)
    at org\.apache\.hadoop\.mapred\.ReduceTask\.run(ReduceTask\.java:416)
    at org\.apache\.hadoop\.mapred\.Child$4\.run(Child\.java:240)
    at java\.security\.AccessController\.doPrivileged(Native Method)
    at javax\.security\.auth\.Subject\.doAs(Subject\.java:396)
    at org\.apache\.hadoop\.security\.UserGroupInformation\.doAs(UserGroupInformation\.java:1115)
    at org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:234)

I am running this clustered hadoop environment without root privileges (just with a regular user).

klbostee commented 13 years ago

The problem really is the way in which you install Dumbo -- it has to be installed as an egg (that hasn't been unzipped into a directory). Commenting out the fileopt stuff hides the symptoms somewhat but it definitely won't fix anything, it actually makes things worse even.

When you start a Dumbo job, Dumbo will send itself along with the job by using the option "-file path_to_egg" internally, which won't work when it's not installed as an egg or when you disable the -file option (but the latter might indeed lead to less explicit errors, as you discovered).

brainstorm commented 13 years ago

Thanks Indeed ! I just "python setup.py install" to generate an egg and works without commenting the code, but fails the same way on the hadoop side:

ReduceAttempt TASK_TYPE="REDUCE" TASKID="task_201102262100_0002_r_000000" TASK_ATTEMPT_ID="attempt_201102262100_0002_r_000000_0" TASK_STATUS="FAILED" FINISH_TIME="1298885499654" HOSTNAME="$HOSTNAME" ERROR="java\.lang\.RuntimeException: PipeMapRed\.waitOutputThreads(): subprocess failed with code 2
    at org\.apache\.hadoop\.streaming\.PipeMapRed\.waitOutputThreads(PipeMapRed\.java:362)
    at org\.apache\.hadoop\.streaming\.PipeMapRed\.mapRedFinished(PipeMapRed\.java:572)
    at org\.apache\.hadoop\.streaming\.PipeReducer\.close(PipeReducer\.java:137)
    at org\.apache\.hadoop\.mapred\.ReduceTask\.runOldReducer(ReduceTask\.java:478)
    at org\.apache\.hadoop\.mapred\.ReduceTask\.run(ReduceTask\.java:416)
    at org\.apache\.hadoop\.mapred\.Child$4\.run(Child\.java:240)
    at java\.security\.AccessController\.doPrivileged(Native Method)
    at javax\.security\.auth\.Subject\.doAs(Subject\.java:396)
    at org\.apache\.hadoop\.security\.UserGroupInformation\.doAs(UserGroupInformation\.java:1115)
    at org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:234)
" .

Any ideas why ? Other hadoop examples (pi estimator) work fine :-S

klbostee commented 13 years ago

Sounds like a bug in your Dumbo script. The Hadoop Java exceptions are rarely useful in that case, you need to check the stderr logs instead (in webui, click on jobid -> failed tasks number -> last 4KB (under "logs")).

brainstorm commented 13 years ago

Yes, here it is:

stderr logs /usr/bin/python: module ipcount not found java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362) (...)

I'm running dumbo as the tutorial states:

dumbo start ipcount.py -hadoop $HADOOP_HOME -input access.log -output ipcounts

Must be something with my python virtual environment not being able to import ipcount.py and dumbo egg then ? Is ipcount.py supposed to bundled in the hadoop job somehow as the dumbo egg ?

klbostee commented 13 years ago

The ipcount.py script should be submitted along with the job as well (by adding "-file ipcount.py" under the hood). Are you sure you enabled all of the fileopt code again?

brainstorm commented 13 years ago

I removed every dumbo/typedbytes file/lib lying around on site-packages and re-installed the egg via python install (rolling back on the commented lines), and it seems that the eggs and "ipcount.py" are passed to the job:

(...) -cmdenv 'PYTHONPATH=dumbo-0.21.30-py2.6.egg:typedbytes-0.3.6-py2.6.egg'
-file 'PATH_TO/ipcount.py'
-file 'PATH_TO.virtualenv/devel/lib/python2.6/site-packages/dumbo-0.21.30-py2.6.egg'
-file 'PATH_TO.virtualenv/devel/lib/python2.6/site-packages/typedbytes-0.3.6-py2.6.egg'

Same result though:

/usr/bin/python: module ipcount not found

Tried to hardcode the sh-bang as the post suggests to my virtualenv's python:

PATH_TO/.virtualenvs/devel/bin/python

But same effect on the Hadoop job:

/usr/bin/python: module ipcount not found

:-(

Thanks for your support !

brainstorm commented 13 years ago

I've been trying to adjust sys.path inside ipcount.py but it still cannot find ipcount(.py) file when running:

2011-03-01 13:26:07,241 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed exec [/usr/bin/python, -m, ipcount, red, 0, 262144000]

Any further ideas ?

stderr logs
/usr/bin/python: module ipcount not found
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
(...)
brainstorm commented 13 years ago

Now I tried to pass "-pypath" and "-python" flags explicitly to dumbo:

$ dumbo start ipcount.py -hadoop $HADOOP_HOME -input access.log -output ipcounts -pypath '.:path/to/.virtualenvs/devel/lib/python2.6/site-packages' -python '/path/to/.virtualenvs/devel/bin/python'

The "." on pypath allows dumbo to find the ipcounts "module".

But now the error refers to the python importer:

'import site' failed; use -v for traceback
Could not import runpy module

I added the -v on dumbo/backends/common.py but I couldn't see clear clues on why "site" does not get imported correctly...

Did you manage to have dumbo flying on virtualenv and -hadoop mode ? From your post, it seems that this is only tested on local mode:

http://dumbotics.com/2009/05/24/virtual-pythonenvironments/

What am I doing wrong ? :-S

brainstorm commented 13 years ago

Moved the issue to dumbo-user mailing list:

http://groups.google.com/group/dumbo-user/t/c9d368625daa2629

dgleich commented 12 years ago

I fixed this issue by the following patch

--- a/dumbo/backends/streaming.py
+++ b/dumbo/backends/streaming.py
@@ -76,7 +76,7 @@ class StreamingIteration(Iteration):
         if modpath.endswith('.egg'):
             addedopts.add('libegg', modpath)
         else:
-            opts.add('file', modpath)
+            opts.add('file', 'file://' + modpath)
         opts.add('jobconf', 'stream.map.input=typedbytes')
         opts.add('jobconf', 'stream.reduce.input=typedbytes')