Closed hinneburg closed 10 years ago
At first glance it seems almost trivial to create a new EMR job flow, but there seems to be a trick to it as all jobs I submit fail. The error messages mention that the main class could not be found.
New jobs can be created using the elastic-mapreduce program provided by amazon. It is also part of the commoncrawl-ami. The following command can be used to create a single-node cluster:
elastic-mapreduce --create --ami-version="2.1.1" --hadoop-version="0.20.205" \
--name "te-common-crawl-test-1" \
--jar "s3n://te-common-crawl/ExampleMetadataDomainPageCount.jar" \
--step-name "Run_test-1" --log-uri "s3n://te-common-crawl/emr" \
--access-id [AWS Access ID] --private-key [AWS Private Key] \
--arg "s3n://te-common-crawl" \
--instance-group master --instance-type m1.small --instance-count 1
I tried passing the name of the main class as a paramter to elastic-mapreduce using the --main-class
flag, but it does not seem to work.
Upon closer inspection EMR uses the org.apache.hadoop.util.RunJar
1 class to start the jar that was passed to elastic-mapreduce. This class inspects the manifest file of the jar for its main class, if none is given it takes the second command line parameter as the main class name.
I expected the main class name to be set in the manifest of the jar file, but the command I used to compile does not seem to set it. For now I updated the command to start an elastic-mapreduce job to
elastic-mapreduce --create --ami-version="2.1.1" --hadoop-version="0.20.205" \
--name "te-common-crawl-test-1" \
--jar "s3n://te-common-crawl/ExampleMetadataDomainPageCount.jar" \
--step-name "Run_test-1" --log-uri "s3n://te-common-crawl/emr" \
--access-id [AWS Access ID] --private-key [AWS Private Key] \
--arg "org/commoncrawl/examples/ExampleMetadataDomainPageCount" \
--arg "s3n://te-common-crawl" \
--instance-group master --instance-type m1.small --instance-count 1
The job still failed, but the error log now states that the output location already exists:
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3n://te-common-crawl already exists
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:938)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:897)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:871)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1308)
at org.commoncrawl.examples.ExampleMetadataDomainPageCount.run(ExampleMetadataDomainPageCount.java:232)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.commoncrawl.examples.ExampleMetadataDomainPageCount.main(ExampleMetadataDomainPageCount.java:244)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
In the case that the output path exists hadoop is known to fail.2
I set the output path to s3n://te-common-crawl/out
which did not exist before running. The job completed successfully and took about 4 Minutes on an m1.small with 1,7GB RAM. The output file has a size of 417mb.
I managed to have my compiler script set the main class of the jar package. This means, that it is no longer necessary to pass the main class name to the java programm. I also moved from passing the credentials directly to storing them in a file and passing that to the tool that starts a new job flow. Then I wrote another shell script to simplify starting new job flows:
#!/bin/bash
instancetype="m1.small"
jarfile=""
outpath="$(basename $(mktemp -u -t out.XXXX))"
bucketname=""
credentialsfile=""
while getopts "b:c:i:j:" option; do
case $option in
b) bucketname="$OPTARG";;
c) credentialsfile=$OPTARG;;
i) instancetype=$OPTARG;;
j) jarfile=$OPTARG;;
esac
done
if [[ -z $bucketname ]]; then
echo "Please supply a bucket name"
exit 1
fi
if [[ -z $credentialsfile ]]; then
echo "Please suppy a credentials file"
exit 1
fi
if [[ -z $jarfile ]]; then
echo "Please supply a jar name"
exit 1
fi
aws put $bucketname $jarfile
elastic-mapreduce --create --ami-version="2.1.1" --hadoop-version="0.20.205" \
--name "te-common-crawl-test-1" \
--jar "s3n://$bucketname/$(basename $jarfile)" \
--step-name "Run_test-1" --log-uri "s3n://$bucketname/logs" \
--credentials $credentialsfile \
--arg "s3n://$bucketname/$outpath" \
--instance-group master --instance-type "$instancetype" --instance-count 1
if [[ $? == 0 ]]; then
echo "Output will be at s3n://$bucketname/$outpath"
else
exit 1
fi
It, too, can use a bit more work and could be made a bit more flexible but making it too complex defies its purpose.
I modified the script to start an emr job to make it possible to start a cluster with more than one node. I also made the step name a bit less meaningless.
#!/bin/bash
function addInstance() {
IFS=':'
set $1
unset IFS
local icount=$1
local igroup=$2
local itype=$3
local ibid=$4
if [[ -z $1 ]]; then
instanceString="$instanceString --instance-group $igroup --instance-type $itype --instance-count $icount"
else
instanceString="$instanceString --instance-group $igroup --instance-type $itype --instance-count $icount --bid-price $ibid"
fi
}
jarfile=""
instanceString=""
outpath="$(basename $(mktemp -u -t out.XXXX))"
bucketname=""
credentialsfile=""
while getopts "b:c:i:j:" option; do
case $option in
b) bucketname="$OPTARG";;
c) credentialsfile=$OPTARG;;
i) addInstance $OPTARG;;
j) jarfile=$OPTARG;;
esac
done
if [[ -z $instanceString ]]; then
echo "Please supply at least one instance"
exit 1
fi
if [[ -z $bucketname ]]; then
echo "Please supply a bucket name"
exit 1
fi
if [[ -z $credentialsfile ]]; then
echo "Please suppy a credentials file"
exit 1
fi
if [[ -z $jarfile ]]; then
echo "Please supply a jar name"
exit 1
fi
stepname="run_$(basename jarfile)_$(date '+%d-%m-%Y')"
aws put $bucketname $jarfile
elastic-mapreduce --create --ami-version="2.1.1" --hadoop-version="0.20.205" \
--name "te-common-crawl-test-1" \
--jar "s3n://$bucketname/$(basename $jarfile)" \
--step-name $stepname --log-uri "s3n://$bucketname/logs" \
--credentials $credentialsfile \
--arg "s3n://$bucketname/$outpath" \
$instanceString
if [[ $? == 0 ]]; then
echo "Output will be at s3n://$bucketname/$outpath"
else
exit 1
fi
To add an instance type pass the parameter -i followed by a string with the following pattern:
I decided not to look for Japanese pages in a single meta-data file as I can not be sure that there are any in the selected file. Instead I will attempt to output all metadata of that segment, hoping that the size of the output file will not exceed a few hundred megabytes. The output file should be a tab-separated-values file containing a list of domain names and JSON objects, containing that domains metadata.
I modified the ExampleMetadataDomainPageCount example to use a mapper that maps
(Text, Text)
to(Text, Text)
instead of(Text, LongWritable)
and replaced theLongSumReducer
1 with theIdentityReducer
2.