common-crawl: extract all information about japanese pages from single meta-data-file

I decided not to look for Japanese pages in a single meta-data file as I can not be sure that there are any in the selected file. Instead I will attempt to output all metadata of that segment, hoping that the size of the output file will not exceed a few hundred megabytes. The output file should be a tab-separated-values file containing a list of domain names and JSON objects, containing that domains metadata.

I modified the ExampleMetadataDomainPageCount example to use a mapper that maps (Text, Text)to (Text, Text)instead of (Text, LongWritable) and replaced the LongSumReducer1 with the IdentityReducer2.

At first glance it seems almost trivial to create a new EMR job flow, but there seems to be a trick to it as all jobs I submit fail. The error messages mention that the main class could not be found.

New jobs can be created using the elastic-mapreduce program provided by amazon. It is also part of the commoncrawl-ami. The following command can be used to create a single-node cluster:

elastic-mapreduce --create --ami-version="2.1.1" --hadoop-version="0.20.205" \
--name "te-common-crawl-test-1" \
--jar "s3n://te-common-crawl/ExampleMetadataDomainPageCount.jar" \
--step-name "Run_test-1" --log-uri "s3n://te-common-crawl/emr" \
--access-id [AWS Access ID] --private-key [AWS Private Key] \
--arg "s3n://te-common-crawl" \
--instance-group master --instance-type m1.small --instance-count 1

I tried passing the name of the main class as a paramter to elastic-mapreduce using the --main-class flag, but it does not seem to work.

Upon closer inspection EMR uses the org.apache.hadoop.util.RunJar1 class to start the jar that was passed to elastic-mapreduce. This class inspects the manifest file of the jar for its main class, if none is given it takes the second command line parameter as the main class name. I expected the main class name to be set in the manifest of the jar file, but the command I used to compile does not seem to set it. For now I updated the command to start an elastic-mapreduce job to

elastic-mapreduce --create --ami-version="2.1.1" --hadoop-version="0.20.205" \
--name "te-common-crawl-test-1" \
--jar "s3n://te-common-crawl/ExampleMetadataDomainPageCount.jar" \
--step-name "Run_test-1" --log-uri "s3n://te-common-crawl/emr" \
--access-id [AWS Access ID] --private-key [AWS Private Key] \
--arg "org/commoncrawl/examples/ExampleMetadataDomainPageCount"  \
--arg "s3n://te-common-crawl" \
--instance-group master --instance-type m1.small --instance-count 1

The job still failed, but the error log now states that the output location already exists:

Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3n://te-common-crawl already exists
    at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:938)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:897)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:871)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1308)
    at org.commoncrawl.examples.ExampleMetadataDomainPageCount.run(ExampleMetadataDomainPageCount.java:232)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.commoncrawl.examples.ExampleMetadataDomainPageCount.main(ExampleMetadataDomainPageCount.java:244)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

In the case that the output path exists hadoop is known to fail.2

I set the output path to s3n://te-common-crawl/out which did not exist before running. The job completed successfully and took about 4 Minutes on an m1.small with 1,7GB RAM. The output file has a size of 417mb.

I managed to have my compiler script set the main class of the jar package. This means, that it is no longer necessary to pass the main class name to the java programm. I also moved from passing the credentials directly to storing them in a file and passing that to the tool that starts a new job flow. Then I wrote another shell script to simplify starting new job flows:

#!/bin/bash

instancetype="m1.small"
jarfile=""
outpath="$(basename $(mktemp -u -t out.XXXX))"
bucketname=""
credentialsfile=""

while getopts "b:c:i:j:" option; do
    case $option in
        b) bucketname="$OPTARG";;
        c) credentialsfile=$OPTARG;;
        i) instancetype=$OPTARG;;
        j) jarfile=$OPTARG;;
    esac
done

if [[ -z $bucketname ]]; then
    echo "Please supply a bucket name"
    exit 1
fi

if [[ -z $credentialsfile ]]; then
    echo "Please suppy a credentials file"
    exit 1
fi

if [[ -z $jarfile ]]; then
    echo "Please supply a jar name"
    exit 1
fi

aws put $bucketname $jarfile

elastic-mapreduce --create --ami-version="2.1.1" --hadoop-version="0.20.205" \
    --name "te-common-crawl-test-1" \
    --jar "s3n://$bucketname/$(basename $jarfile)" \
    --step-name "Run_test-1" --log-uri "s3n://$bucketname/logs" \
    --credentials $credentialsfile \
    --arg "s3n://$bucketname/$outpath" \
    --instance-group master --instance-type "$instancetype" --instance-count 1

if [[ $? == 0 ]]; then
    echo "Output will be at s3n://$bucketname/$outpath"
else
    exit 1
fi

It, too, can use a bit more work and could be made a bit more flexible but making it too complex defies its purpose.

I modified the script to start an emr job to make it possible to start a cluster with more than one node. I also made the step name a bit less meaningless.

#!/bin/bash

function addInstance() {
    IFS=':'
    set $1
    unset IFS

    local icount=$1
    local igroup=$2
    local itype=$3
    local ibid=$4

    if [[ -z $1 ]]; then
        instanceString="$instanceString --instance-group $igroup --instance-type $itype --instance-count $icount"
    else
        instanceString="$instanceString --instance-group $igroup --instance-type $itype --instance-count $icount --bid-price $ibid"
    fi
}

jarfile=""
instanceString=""
outpath="$(basename $(mktemp -u -t out.XXXX))"
bucketname=""
credentialsfile=""

while getopts "b:c:i:j:" option; do
    case $option in
        b) bucketname="$OPTARG";;
        c) credentialsfile=$OPTARG;;
        i) addInstance $OPTARG;;
        j) jarfile=$OPTARG;;
    esac
done

if [[ -z $instanceString ]]; then
    echo "Please supply at least one instance"
    exit 1
fi

if [[ -z $bucketname ]]; then
    echo "Please supply a bucket name"
    exit 1
fi

if [[ -z $credentialsfile ]]; then
    echo "Please suppy a credentials file"
    exit 1
fi

if [[ -z $jarfile ]]; then
    echo "Please supply a jar name"
    exit 1
fi

stepname="run_$(basename jarfile)_$(date '+%d-%m-%Y')"

aws put $bucketname $jarfile

elastic-mapreduce --create --ami-version="2.1.1" --hadoop-version="0.20.205" \
    --name "te-common-crawl-test-1" \
    --jar "s3n://$bucketname/$(basename $jarfile)" \
    --step-name $stepname --log-uri "s3n://$bucketname/logs" \
    --credentials $credentialsfile \
    --arg "s3n://$bucketname/$outpath" \
    $instanceString

if [[ $? == 0 ]]; then
    echo "Output will be at s3n://$bucketname/$outpath"
else
    exit 1
fi

To add an instance type pass the parameter -i followed by a string with the following pattern:

::[: := [0-9]+ := master | core | task := one of the EC2 instance-types available from AWS := [0-9]+.[0-9]+

hinneburg / TopicExplorer

common-crawl: extract all information about japanese pages from single meta-data-file #71