Assignment 4 Issues/Questions

neuralinfo commented 9 years ago

Write your issues/questions about assignment 4 here

howardwen commented 9 years ago

I'm getting an error when running EMR, tried this over 10 times and keeps failing.

Is there something I'm doing wrong?

The stderr logfile shows:

Exception in thread "main" java.lang.IllegalArgumentException: The bucketName parameter must be specified. at com.amazonaws.services.s3.AmazonS3Client.rejectNull(AmazonS3Client.java:2917) at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1074) at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1042) at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.ensureBucketExists(Jets3tNativeFileSystemStore.java:105) at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:86) ......

These are the settings: JAR location: /home/hadoop/contrib/streaming/hadoop-streaming.jar Main class: None Arguments: -files s3://hw.w205.assignment4.1/code/wMapOccur.py,s3://hw.w205.assignment4.1/code/wReducerOccur.py -mapper wMapOccur.py -reducer wReducerOccur.py -input s3://hw.w205.assignment4.1/input/WC2015.csv -output s3://hw.w205.assignment4.1/output/job0001 Action on failure: Terminate cluster

neuralinfo commented 9 years ago

Does job0001 exist in your output?if yes, you need to delete it before launching your cluster or change the parameter in job0002 in your setup

howardwen commented 9 years ago

job0001 does not exist in the output folder. I tried to change the parameter to job0002 but that doesn't work either. Can't seem to figure out the issue.

neuralinfo commented 9 years ago

Are you on windows? If yes, make sure you do not have any non-ascii character in the bucket names and check whether other users has permissions to access to your bucket. If you are on Mac, have you checked the permission for your bucket?.It needs to have read/write permission.

JaimeVL commented 9 years ago

I'm using MrJob and so far I've been able to successfully run all the EMR jobs when I use a single python script. However, to create more manageable code for my assignment, I decided to split some common code I use across my MR programs and use an additional utils.py file. I just want to be able to import this from my main python script (let's call it program.py) by calling this:

from utils import *

I have added the [--file] option and specified the utils.py file in the command line when I run MrJob as shown below:

python program.py -r emr s3://jvl-mids-w205-assignment4/input/tweets.csv --output-dir=s3://jvl-mids-w205-assignment4/output/job/ --no-output --conf-path mrjob.conf --file utils.py

Unfortunately, I get the following error in EMR:

Traceback (most recent call last): File "program1.py", line 6, in <module> from utils import * ImportError: No module named utils

The interesting thing is that both program.py and utils.py are in the same folder in S3, so not sure why the utils.py file is not recognized. Any idea what I'm missing? I found something online related to this, but I don't really understand the answer, so any help is appreciated.

http://stackoverflow.com/questions/26812124/hadoop-streaming-with-multiple-python-files

saffrydaffry commented 9 years ago

Does mrjob read from s3 though? I think for the --file you still might need an s3 address. otherwise it might be thinking utils.py is within the cc-job directory

neuralinfo commented 9 years ago

@JaimeVL : you can add the files to the distributed cache. Check this:

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-input-distributed-cache.html

JaimeVL commented 9 years ago

Thanks! :-). I was able to accomplish this with the [--python-archive] option in MrJob. Here's the sample command line statement I used:

python program.py -r emr s3://jvl-mids-w205-assignment4/input/tweets.csv --output-dir=s3://jvl-mids-w205-assignment4/output/job/ --no-output --conf-path mrjob.conf --python-archive utils.py

hdanish commented 9 years ago

I was also having issues running on EMR when specifying the full path of the mapper and the reducer in the configuration in the web interface. However, if I only put the filenames and then I add in the arguments box -files s3:///,s3:/// then it works. What would be the reason why using this setup works whereas leaving the arguments blank and specifying these paths instead of only the filename would give a file not found error?

neuralinfo commented 9 years ago

By default the web interface has to read from S3 and gets the path to your mapper on S3 (and reducer) with the following standard: BucketName/path/Mapperprogram.py. It needs the mapper/reducer executable to be part of the path. It then puts that path in the args once you finish configuring your steps. If you look at the command that it generates under steps section after you provided the paths, you'll notice what have been added to the commands

kchoi01 commented 9 years ago

I sometimes run into the same problem. I checked the command it generated and the -files option has the correct files. If I try cloning the cluster and run again, then it works. Very strange.

neuralinfo / Assignments

Assignment 4 Issues/Questions #23