Hadoop code review request

Purpose of code changes on this branch:

This is the initial code check-in for the Hadoop implementation of Random 
Indexing

Overview of code:

1. A new top-level directory for compiling optional code

2. New Hadoop-specific code

3. The core logic is contained in WordCooccurrenceCountingJob, which uses 
Hadoop to extract out all the co-occurrences from text

4. A Random Indexing implementation that uses the output of 
WordCooccurrenceCountingJob to generate an SemanticSpace

5. A new utility for incrementally writing SemanticSpace instances

When reviewing my code changes, please focus on:

1. Class and interface naming.  Is it clear what things do?

2. Re-usability.  Does it seem easy to extend the Hadoop utilities for 
implementing other word co-occurrence algorithms?  I'm especially thinking of 
Beagle, as that's probably the most complicated algorithm.

3. The SemanticSpaceWriter doesn't work for the Text formats right now.  I 
think some PrintWriter magic _might_ be possible, but if you see any obvious 
way to fix it, let me know.

4. Is it easy for users to figure out how these classes work?  I have Hadoop up 
and running on Hydra, but maybe you can test it on some corpus on Kracken to 
see if it's clear for you.

5. Double check the tokenization logic.  There's a big issue where things like 
the valid token set file has to exist in the HDFS, rather than the regular FS.  
I think I have the logic correct, but I would definitely appreciate you taking 
a look whether all the cases are covered.  Especially, if you see an alternate 
way of getting IteratorFactory to work.

After the review, I'll merge this branch into:
/trunk
Original issue reported on code.google.com by David.Ju...@gmail.com on 3 Sep 2010 at 8:59
mitrevf / airhead-research

Hadoop code review request #68