Purpose of code changes on this branch:
This is the initial code check-in for the Hadoop implementation of Random
Indexing
Overview of code:
1. A new top-level directory for compiling optional code
2. New Hadoop-specific code
3. The core logic is contained in WordCooccurrenceCountingJob, which uses
Hadoop to extract out all the co-occurrences from text
4. A Random Indexing implementation that uses the output of
WordCooccurrenceCountingJob to generate an SemanticSpace
5. A new utility for incrementally writing SemanticSpace instances
When reviewing my code changes, please focus on:
1. Class and interface naming. Is it clear what things do?
2. Re-usability. Does it seem easy to extend the Hadoop utilities for
implementing other word co-occurrence algorithms? I'm especially thinking of
Beagle, as that's probably the most complicated algorithm.
3. The SemanticSpaceWriter doesn't work for the Text formats right now. I
think some PrintWriter magic _might_ be possible, but if you see any obvious
way to fix it, let me know.
4. Is it easy for users to figure out how these classes work? I have Hadoop up
and running on Hydra, but maybe you can test it on some corpus on Kracken to
see if it's clear for you.
5. Double check the tokenization logic. There's a big issue where things like
the valid token set file has to exist in the HDFS, rather than the regular FS.
I think I have the logic correct, but I would definitely appreciate you taking
a look whether all the cases are covered. Especially, if you see an alternate
way of getting IteratorFactory to work.
After the review, I'll merge this branch into:
/trunk
Original issue reported on code.google.com by David.Ju...@gmail.com on 3 Sep 2010 at 8:59
Original issue reported on code.google.com by
David.Ju...@gmail.com
on 3 Sep 2010 at 8:59