The filenames don't get escaped in output

I was running a job that outputted to 'twoo/flowanalysis/2012/09/*', but this gives issues because when dumbo runs the hdfs (re)move operations (on overwrite="yes" for instance), it doesn't escape it properly and this results in an error.

See the output below:

12/09/06 14:00:00 INFO streaming.StreamJob: map 100% reduce 100% 12/09/06 14:00:32 INFO streaming.StreamJob: Job complete: job_201208201604_77368 12/09/06 14:00:32 INFO streaming.StreamJob: Output: twoo/flowanalysis/2012/09/pre1 Moved to trash: hdfs://hadoopname02/user/poison/twoo/flowanalysis/2012/09/pre1 Moved to trash: hdfs://hadoopname02/user/poison/twoo/flowanalysis/2012/09/04 EXEC: HADOOP_CLASSPATH="/home/jeroen/mm.metrics/jars/tusks.jar:$HADOOP_CLASSPATH" /usr/lib/hadoop/bin/hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u4.jar -outputformat 'org.apache.hadoop.mapred.SequenceFileOutputFormat' -inputformat 'org.apache.hadoop.streaming.AutoInputFormat' -mapper 'python -m base64_users map 1 629145600' -reducer 'python -m base64_users red 1 629145600' -numReduceTasks '60' -file '/home/poison/scripts/base64_users.py' -file '/usr/lib/dumbo/eggs/ctypedbytes-0.1.8-py2.6-linux-x8664.egg' -file '/usr/lib/dumbo/lib/python2.6/site-packages/dumbo-0.21.34-py2.6.egg' -file '/usr/lib/dumbo/lib/python2.6/site-packages/typedbytes-0.3.8-py2.6.egg' -file '/home/jeroen/mm.metrics/jars/tusks.jar' -output 'twoo/flowanalysis/2012/09/' -jobconf 'stream.map.input=typedbytes' -jobconf 'stream.reduce.input=typedbytes' -jobconf 'stream.map.output=typedbytes' -jobconf 'stream.reduce.output=typedbytes' -jobconf 'mapred.job.name=base64_users.py (2/2)' -input 'twoo/flowanalysis/2012/09/__pre1' -cmdenv 'dumbo_mrbase_class=dumbo.backends.common.MapRedBase' -cmdenv 'dumbo_jk_class=dumbo.backends.common.JoinKey' -cmdenv 'dumbo_runinfo_class=dumbo.backends.streaming.StreamingRunInfo' -cmdenv 'PYTHON_EGG_CACHE=/tmp/eggcache' -cmdenv 'PYTHONPATH=ctypedbytes-0.1.8-py2.6-linux-x86_64.egg:dumbo-0.21.34-py2.6.egg:typedbytes-0.3.8-py2.6.egg' 12/09/06 14:00:33 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead. packageJobJar: [/home/poison/scripts/base64_users.py, /usr/lib/dumbo/eggs/ctypedbytes-0.1.8-py2.6-linux-x86_64.egg, /usr/lib/dumbo/lib/python2.6/site-packages/dumbo-0.21.34-py2.6.egg, /usr/lib/dumbo/lib/python2.6/site-packages/typedbytes-0.3.8-py2.6.egg, /home/jeroen/mm.metrics/jars/tusks.jar, /tmp/hadoop-poison/hadoop-unjar6857496811056060599/] [] /tmp/streamjob4826297620725394446.jar tmpDir=null 12/09/06 14:00:33 INFO mapred.JobClient: Cleaning up the staging area hdfs://hadoopname02/tmp/hadoop-mapred/mapred/staging/poison/.staging/job_201208201604_77492 12/09/06 14:00:33 ERROR security.UserGroupInformation: PriviledgedActionException as:poison (auth:SIMPLE) cause:org.apache.hadoop.mapred.InvalidInputException: Input Pattern hdfs://hadoopname02/user/poison/twoo/flowanalysis/2012/09/pre1 matches 0 files 12/09/06 14:00:33 ERROR streaming.StreamJob: Error Launching job : Input Pattern hdfs://hadoopname02/user/poison/twoo/flowanalysis/2012/09/pre1 matches 0 files Streaming Command Failed!

By the way; thanks a lot for this great contribution. I use it almost every day and works like a charm. I really like using the hadoop streaming in python!

Nicolas

klbostee / dumbo

The filenames don't get escaped in output #59