hinneburg / TopicExplorer

TopicExplorer
GNU Affero General Public License v3.0
11 stars 3 forks source link

How to read meata data files #61

Closed hinneburg closed 10 years ago

hinneburg commented 10 years ago
fluecke commented 10 years ago

compile code:

$ ls
ExampleMetadataDomainPageCount.java  hadoop_dependencies

$ ls hadoop_dependencies/
gson-2.2.1.jar  guava-12.0.jar  httpcore-4.2.1.jar  jsoup-1.6.3.jar

$ mkdir Metadata

$ javac -cp $(hadoop classpath):/home/ec2-user/hadoop_dependencies/gson-2.2.1.jar:/home/ec2-user/hadoop_dependencies/guava-12.0.jar -d Metadata/ ExampleMetadataDomainPageCount.java

$ jar cvf ExampleMetadataDomainPageCount.jar -C Metadata/ .
fluecke commented 10 years ago

run code:

$ hadoop jar ExampleMetadataDomainPageCount.jar org.commoncrawl.examples.ExampleMetadataDomainPageCount ~/out
fluecke commented 10 years ago

Tasks fail with a java.io.IOException:

14/06/05 09:49:57 INFO mapred.JobClient: Task Id : attempt_201406050919_0004_m_000001_2, Status : FAILED
java.lang.RuntimeException: Error while running command to get file permissions : java.io.IOException: Cannot run program "/bin/ls": java.io.IOException: error=12, Cannot allocate memory
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:488)
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)
    at org.apache.hadoop.util.Shell.run(Shell.java:182)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
    at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:703)
    at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:418)
    at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:401)
    at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:251)
    at org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:260)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:416)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory
    at java.lang.UNIXProcess.<init>(UNIXProcess.java:164)
    at java.lang.ProcessImpl.start(ProcessImpl.java:81)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:470)
    ... 15 more

    at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:443)
    at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:401)
    at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:251)
    at org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:260)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:416)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

This happens until enough tasks failed for the whole job to be aborted:

14/06/05 09:50:12 INFO mapred.JobClient: Job complete: job_201406050919_0004
14/06/05 09:50:12 INFO mapred.JobClient: Counters: 7
14/06/05 09:50:12 INFO mapred.JobClient:   Job Counters 
14/06/05 09:50:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=112652
14/06/05 09:50:12 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/06/05 09:50:12 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/06/05 09:50:12 INFO mapred.JobClient:     Rack-local map tasks=8
14/06/05 09:50:12 INFO mapred.JobClient:     Launched map tasks=8
14/06/05 09:50:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
14/06/05 09:50:12 INFO mapred.JobClient:     Failed map tasks=1
14/06/05 09:50:12 INFO mapred.JobClient: Job Failed: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201406050919_0004_m_000000
Exception in thread "main" java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1257)
    at org.commoncrawl.examples.ExampleMetadataDomainPageCount.run(ExampleMetadataDomainPageCount.java:232)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.commoncrawl.examples.ExampleMetadataDomainPageCount.main(ExampleMetadataDomainPageCount.java:244)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:622)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

It seems the t1.micro might not be sufficient after all, this is due to how java spawns new threads, see here.

fluecke commented 10 years ago

I have written a small bash script to make compiling hadoop classes easier. It is not well documented and can use some work in the future but it works for now. The first argument to the script is the class name, followed by any number of paths to additional dependencies.

#!/bin/bash

readonly ARGS="$@"

function compile_class() {
    local class_directory="$1"
    local class="$2"

    shift 2

    local additional_dependencies="$(for dep in "$@"; do echo -n :$dep; done)"
    local hadoop_classpath=$(hadoop classpath)

    echo -n "Compiling class..."
    javac -cp $hadoop_classpath$additional_dependencies \
        -d $class_directory $class
    echo " done."
}

function make_jar() {
    local class_directory="$1"
    local jar_name="$2"

    echo -n "Creating jar... "
    jar cvf $jar_name -C $class_directory . >/dev/null
    echo "done."

}

function main() {
    set $ARGS
    local class="$1"
    local jar_name="$(basename $class)"
    jar_name="${jar_name%.java}.jar"
    local class_directory="$(mktemp -d)"

    shift

    local dependencies="$@" 

    compile_class $class_directory $class "$@"
    make_jar $class_directory $jar_name

    echo "All done. Look for \"${jar_name}\" in your current directory."

    rm -rf $class_directory
}

main