Closed paulhoule closed 10 years ago
The logs I get back from AWS are incomplete so it's not practical to look through them all to see if there is a problem with the pageIds or transitiveRedirects. I'm wondering if the problem is as simple as the input files are bz2 compressed and the system isn't configured to handle this. A logical thing to do is to create a micro test case that is small enough I could put a very chatty logger in.
Ok, I fixed multiple problems leading into this. One was that I wasn't putting all of the files in the input path and another was that my 'SortComparator' only sorted the secondary keys, not the combination of primary and secondary.
Now I've got a similar but related problem. When I run
haruhi run job -clusterId smallAwsCluster -jarId telepath countDBpediaMerge \
-pageIds s3n://o2dbpedia/3.9/en/page_ids_en.nt.bz2 \
-transitiveRedirects s3n://o2dbpedia/3.9/en/redirects_transitive_en.nt.bz2 \
-input s3n://wikimedia-summary/3dNormalizedRaw \
-output s3n://wikimedia-summary/resolvedAgainstDBpedia \
-R 4
I get some output, but way too little. I could about 11,000 output lines, and it should be much more than that, something more like 4 million lines.
I'm a little embarrassed to admit it, but I did a test run where I filtered all of the input files with
grep Justin | grep Bieber
and in that case the joins worked. If I scan the output of the above job like
zgrep Justin *.gz
I see nothing, so at least we can use him as a test case once again.
I had the crazy idea that we'd configured Hadoop in such a way that we couldn't process the bzip2 form of the dbpedia files and, by golly, that seems to be the case. Once I recompressed these files and tried
haruhi run job -clusterId smallAwsCluster -jarId telepath countDBpediaMerge \
-pageIds s3n://o2dbpedia/3.9/en/page_ids_en.nt.gz \
-transitiveRedirects s3n://o2dbpedia/3.9/en/redirects_transitive_en.nt.gz \
-input s3n://wikimedia-summary/3dNormalizedRaw \
-output s3n://wikimedia-summary/resolvedAgainstDBpediaAllGzip \
-R 4
I get 4 files that are 40MB a piece, which is about what I'd expect.
I run this job
and I get empty output from all reducer channels. Looking at one of the mapper channels it looks like something is going out:
the data is getting lost in the reducer, but this may well be the fault of wrong data coming from the mapper.