Implement Word Sense Induction (WSI) based on Chinese Whispers (CW) in Spark

Motivation

Currently one important component of the JoBimText pipeline is conducted in a non-distributed fashion, namely the word sense induction. This means transfer of files from the HDFS and back. Also this limits scalability of the method. Your goal is to implement the component in a distributed way.

Implementation

Download the similarity graph: http://cental.fltr.ucl.ac.be/team/~panchenko/data/serelex/norm60.tgz
Replace all ";" to "\t" in the file (the separator).
Run current WSI method locally: https://github.com/tudarmstadt-lt/chinese-whispers. Use the https://github.com/tudarmstadt-lt/chinese-whispers/blob/master/run.sh
Read the paper to understand how the algorithm works: http://www.aclweb.org/website/old_anthology/W/W06/W06-38.pdf#page=83
Implement the algorithm with Spark GraphX library. For reference see:
Write unit tests that make sure that the output of your implementation is the same as the original one.
Write a report measuring memory consumption, computation time and occupied disk space for two implementations of the WSI system.

uhh-lt / josimtext

Implement Word Sense Induction (WSI) based on Chinese Whispers (CW) in Spark #2

Motivation

Implementation