Clustering gigaword streams.
brew install libschwa
virtualenv -p python3.4 ve
source ve/bin/activate
pip install libschwa-python
~repos/virtualenv/virtualenv.py -p python3.4 ve
source ve/bin/activate
easy_install-3.4 /path/to/pip-1.2.1.tar.gz
pip install -e ~/repos/nltk
CC=gcc CXX=g++ ve/bin/python3.4 -m pip install libschwa-python
pip install -e .
Change ve/bin/activate
to include a line like:
export READABILITY_API_TOKEN=<hash>
Change data
and cache
to point somewhere you wish to store the data.
Change the following in bin/gn_cron.sh
to where you checked out the code.
cd /data1/gigacluster/gigacluster
Converting gigaword to docrep - 5h, 20m across 8 processors:
time find ~/schwa/corpora/raw/gigaword-en5/data/ -type f | sort | \
parallel -j 8 ./gigacluster/xml_to_dr.py --output-dir /data1/gigacluster/dr {}
Reorganising into data directories (manually):
data/
├── afp
│ └── afp_eng_201001.dr
├── apw
│ └── apw_eng_201001.dr
├── cna
│ └── cna_eng_201001.dr
├── nyt
│ └── nyt_eng_201001.dr
├── wpb
│ └── wpb_eng_201001.dr
└── xin
└── xin_eng_201001.dr
Building an IDF model:
find data/ -iname '*dr' | parallel ./gigacluster/print_df_tokens.py {} ">" {.}.df
cat data/*/*dr | dr-count > data/n
cat data/*/*df | ./gigacluster/build_idf_model.py -n $(cat data/n) > data/idf.txt
Clustering a month across all streams:
time ./gigacluster/print_clusters.py \
-t 0.4 -m SentenceBOWOverlap -l 10 \
-p data/nyt/ -s data/apw/ \
-s data/afp/ -s data/cna/ \
-s data/wpb/ -s data/xin/
This takes sentences:
There are also options to weight by IDF.
Generating the jobs
for i in afp apw cna wpb nyt xin; do for y in `seq 1994 2010`; do echo $i $y; done; done > jobs.txt
Running via parallel
cat jobs.txt | parallel -j
parallel -j 10 ./cluster_year.sh /data1/gigacluster/clustering-0.4-0.4/ {}