paulhoule / telepath

System for mining Wikipedia Usage data to read our collective mind
MIT License
21 stars 1 forks source link

No output from countDBpediaMerge #5

Closed paulhoule closed 10 years ago

paulhoule commented 10 years ago

I run this job

haruhi run job -clusterId smallAwsCluster -jarId telepath countDBpediaMerge \
 -pageIds s3n://o2dbpedia/3.9/en/page_ids_en.nt.bz2 \
 -transitiveRedirects s3n://o2dbpedia/3.9/en/redirects_transitive_en.bz2 \
 -input s3n://wikimedia-summary/3dNormalizedRaw \
 -output s3n://wikimedia-summary/resolvedAgainstDBpedia-smallCluster \
 -R 4 

and I get empty output from all reducer channels. Looking at one of the mapper channels it looks like something is going out:

2014-02-03 21:06:55,326 INFO org.apache.hadoop.mapred.Merger (main): Merging 9 sorted segments
2014-02-03 21:06:55,335 INFO org.apache.hadoop.mapred.Merger (main): Down to the last merge-pass, with 9 segments left of total size: 18659089 bytes
2014-02-03 21:07:04,400 INFO org.apache.hadoop.mapred.Merger (main): Merging 9 sorted segments
2014-02-03 21:07:04,409 INFO org.apache.hadoop.mapred.Merger (main): Down to the last merge-pass, with 9 segments left of total size: 18751397 bytes
2014-02-03 21:07:13,376 INFO org.apache.hadoop.mapred.Merger (main): Merging 9 sorted segments
2014-02-03 21:07:13,385 INFO org.apache.hadoop.mapred.Merger (main): Down to the last merge-pass, with 9 segments left of total size: 18113942 bytes
2014-02-03 21:07:22,230 INFO org.apache.hadoop.mapred.Merger (main): Merging 9 sorted segments
2014-02-03 21:07:22,239 INFO org.apache.hadoop.mapred.Merger (main): Down to the last merge-pass, with 9 segments left of total size: 18626433 bytes
2014-02-03 21:07:31,115 INFO org.apache.hadoop.mapred.Task (main): Task:attempt_201402032101_0001_m_000016_0 is done. And is in the process of commiting
2014-02-03 21:07:32,348 INFO org.apache.hadoop.mapred.Task (main): Task 'attempt_201402032101_0001_m_000016_0' done.
2014-02-03 21:07:32,349 INFO org.apache.hadoop.mapred.TaskLogsTruncater (main): Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1

the data is getting lost in the reducer, but this may well be the fault of wrong data coming from the mapper.

paulhoule commented 10 years ago

The logs I get back from AWS are incomplete so it's not practical to look through them all to see if there is a problem with the pageIds or transitiveRedirects. I'm wondering if the problem is as simple as the input files are bz2 compressed and the system isn't configured to handle this. A logical thing to do is to create a micro test case that is small enough I could put a very chatty logger in.

paulhoule commented 10 years ago

Ok, I fixed multiple problems leading into this. One was that I wasn't putting all of the files in the input path and another was that my 'SortComparator' only sorted the secondary keys, not the combination of primary and secondary.

Now I've got a similar but related problem. When I run

haruhi run job -clusterId smallAwsCluster -jarId telepath countDBpediaMerge \
 -pageIds s3n://o2dbpedia/3.9/en/page_ids_en.nt.bz2 \
 -transitiveRedirects s3n://o2dbpedia/3.9/en/redirects_transitive_en.nt.bz2 \
 -input s3n://wikimedia-summary/3dNormalizedRaw \
 -output s3n://wikimedia-summary/resolvedAgainstDBpedia \
 -R 4 

I get some output, but way too little. I could about 11,000 output lines, and it should be much more than that, something more like 4 million lines.

paulhoule commented 10 years ago

I'm a little embarrassed to admit it, but I did a test run where I filtered all of the input files with

grep Justin | grep Bieber

and in that case the joins worked. If I scan the output of the above job like

zgrep Justin *.gz

I see nothing, so at least we can use him as a test case once again.

paulhoule commented 10 years ago

I had the crazy idea that we'd configured Hadoop in such a way that we couldn't process the bzip2 form of the dbpedia files and, by golly, that seems to be the case. Once I recompressed these files and tried

haruhi run job -clusterId smallAwsCluster -jarId telepath countDBpediaMerge \
 -pageIds s3n://o2dbpedia/3.9/en/page_ids_en.nt.gz \
 -transitiveRedirects s3n://o2dbpedia/3.9/en/redirects_transitive_en.nt.gz \
 -input s3n://wikimedia-summary/3dNormalizedRaw \
 -output s3n://wikimedia-summary/resolvedAgainstDBpediaAllGzip \
 -R 4 

I get 4 files that are 40MB a piece, which is about what I'd expect.