Open neuralinfo opened 9 years ago
I'm getting an error when running EMR, tried this over 10 times and keeps failing.
Is there something I'm doing wrong?
The stderr logfile shows:
Exception in thread "main" java.lang.IllegalArgumentException: The bucketName parameter must be specified. at com.amazonaws.services.s3.AmazonS3Client.rejectNull(AmazonS3Client.java:2917) at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1074) at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1042) at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.ensureBucketExists(Jets3tNativeFileSystemStore.java:105) at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:86) ......
These are the settings: JAR location: /home/hadoop/contrib/streaming/hadoop-streaming.jar Main class: None Arguments: -files s3://hw.w205.assignment4.1/code/wMapOccur.py,s3://hw.w205.assignment4.1/code/wReducerOccur.py -mapper wMapOccur.py -reducer wReducerOccur.py -input s3://hw.w205.assignment4.1/input/WC2015.csv -output s3://hw.w205.assignment4.1/output/job0001 Action on failure: Terminate cluster
Does job0001 exist in your output?if yes, you need to delete it before launching your cluster or change the parameter in job0002 in your setup
job0001 does not exist in the output folder. I tried to change the parameter to job0002 but that doesn't work either. Can't seem to figure out the issue.
Are you on windows? If yes, make sure you do not have any non-ascii character in the bucket names and check whether other users has permissions to access to your bucket. If you are on Mac, have you checked the permission for your bucket?.It needs to have read/write permission.
I'm using MrJob and so far I've been able to successfully run all the EMR jobs when I use a single python script. However, to create more manageable code for my assignment, I decided to split some common code I use across my MR programs and use an additional utils.py file. I just want to be able to import this from my main python script (let's call it program.py) by calling this:
from utils import *
I have added the [--file] option and specified the utils.py file in the command line when I run MrJob as shown below:
python program.py -r emr s3://jvl-mids-w205-assignment4/input/tweets.csv --output-dir=s3://jvl-mids-w205-assignment4/output/job/ --no-output --conf-path mrjob.conf --file utils.py
Unfortunately, I get the following error in EMR:
Traceback (most recent call last):
File "program1.py", line 6, in <module>
from utils import *
ImportError: No module named utils
The interesting thing is that both program.py and utils.py are in the same folder in S3, so not sure why the utils.py file is not recognized. Any idea what I'm missing? I found something online related to this, but I don't really understand the answer, so any help is appreciated.
http://stackoverflow.com/questions/26812124/hadoop-streaming-with-multiple-python-files
Does mrjob read from s3 though? I think for the --file you still might need an s3 address. otherwise it might be thinking utils.py is within the cc-job directory
@JaimeVL : you can add the files to the distributed cache. Check this:
Thanks! :-). I was able to accomplish this with the [--python-archive] option in MrJob. Here's the sample command line statement I used:
python program.py -r emr s3://jvl-mids-w205-assignment4/input/tweets.csv --output-dir=s3://jvl-mids-w205-assignment4/output/job/ --no-output --conf-path mrjob.conf --python-archive utils.py
I was also having issues running on EMR when specifying the full path of the mapper and the reducer in the configuration in the web interface. However, if I only put the filenames and then I add in the arguments box -files s3://
By default the web interface has to read from S3 and gets the path to your mapper on S3 (and reducer) with the following standard: BucketName/path/Mapperprogram.py. It needs the mapper/reducer executable to be part of the path. It then puts that path in the args once you finish configuring your steps. If you look at the command that it generates under steps section after you provided the paths, you'll notice what have been added to the commands
I sometimes run into the same problem. I checked the command it generated and the -files option has the correct files. If I try cloning the cluster and run again, then it works. Very strange.
Write your issues/questions about assignment 4 here