uhh-lt / josimtext

A system for word sense induction and disambiguation based on JoBimText approach
http://jobimtext.org/wsd
16 stars 4 forks source link

Implement Word Sense Induction (WSI) based on Chinese Whispers (CW) in Spark #2

Closed alexanderpanchenko closed 8 years ago

alexanderpanchenko commented 9 years ago

Motivation

Currently one important component of the JoBimText pipeline is conducted in a non-distributed fashion, namely the word sense induction. This means transfer of files from the HDFS and back. Also this limits scalability of the method. Your goal is to implement the component in a distributed way.

Implementation

  1. Download the similarity graph: http://cental.fltr.ucl.ac.be/team/~panchenko/data/serelex/norm60.tgz
  2. Replace all ";" to "\t" in the file (the separator).
  3. Run current WSI method locally: https://github.com/tudarmstadt-lt/chinese-whispers. Use the https://github.com/tudarmstadt-lt/chinese-whispers/blob/master/run.sh
  4. Read the paper to understand how the algorithm works: http://www.aclweb.org/website/old_anthology/W/W06/W06-38.pdf#page=83
  5. Implement the algorithm with Spark GraphX library. For reference see:
  6. Write unit tests that make sure that the output of your implementation is the same as the original one.
  7. Write a report measuring memory consumption, computation time and occupied disk space for two implementations of the WSI system.
ghost commented 8 years ago

Link to the modified file

https://drive.google.com/file/d/0B7432LxrXUpAMUZZN3ZxcG5Xa2c/view?usp=sharing