tudarmstadt-lt / structured-topics

Structured Topics using Distributional Semantics
Apache License 2.0
1 stars 2 forks source link

Implement an initial prototype of the system #1

Open alexanderpanchenko opened 9 years ago

alexanderpanchenko commented 9 years ago

Motivation

We need to develop a prototype of the system that builds structured topics (creates a model) and is able to label new texts according to these topics. This prototype is supposed to have all minimal functionality of the system (input/output) and implemented with the most straightforward set of algorithms. The goal is to make initial validation of the idea and then improve the prototype gradually. In this step we do no evaluation which will be done later as well (during the "official" 6 month period reserved for writing the thesis).

Implementation

The prototype will build structured topics out of sense clusters i.e. it will cluster word senses which are provided as input. The system will rely on existing developments of the LT team in order to efficiently implement computation of a similarity graph of senses.

The overall pipeline of the prototype (to be implemented in Java/Scala or a mix of both):

  1. Download the data -- a Disambiguated Distributional Thesaurus (DDT) build from the JoBimText sense clusters:
  2. Create a sense-feature matrix out of these files in the following format:

    sense-id<TAB>cluster-word<TAB>similarity-of-sense-and-cluster-word

    for instance

    python#np#2    java#np#4    0.75

    here 0.75 is 1/(r+1)^0.33, where r is the rank of cluster word from 1 to n

    To be more precise, the system should generate the same output as this package: https://github.com/tudarmstadt-lt/noun-sense-induction

    This includes two additional files: sense-id<TAB>freq and cluster-word<TAB>freq.

  3. For each sense retrieve top 200 most similar senses i.e. calculate a similarity graph of senses. You will need to adapt this project to do so efficiently using Apache Spark framework: https://github.com/tudarmstadt-lt/noun-sense-induction-scala. To understand how it works read

    The framework outputs a file in the format:

    sense-id<TAB>sense-id<TAB>similarity

    for each sense, it would generate 200 nearest neighbours.

    For start with Spark just look at examples: http://spark.apache.org/

  4. Cluster the graph of sense similarities using Chinese Whispers or/and Markov Chain Clustering algorithm. Use this implementation: https://github.com/johannessimon/chinese-whispers. Alternatively you can use this implementation: http://maggie.lt.informatik.tu-darmstadt.de/jobimtext/documentation/sense-clustering/

    In your case, instead of words you will cluster word senses.

    http://wortschatz.uni-leipzig.de/~cbiemann/pub/2006/BiemannTextGraph06.pdf

    structured-topici-id<TAB>sense-id-1,sense-id-2,sense-id-3,...
  5. Write a classification module that would use the structured topics, being clusters of senses, to annotate text documents. The module should

    • load the structured topics
    structured-topici-id<TAB>cluster-word-1,cluster-word-2,cluster-word-3,...
    • for an input document output a set of most relevant topics
    • each output topic should have a confidence of the classification

    To implement this module you should use ElasticSearch index. One topic would be one document, and then use an input document as search query. The retrieval system will return a list of documents (topics) according to their TF-IDF score.

https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html

https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

ghost commented 9 years ago

How is the value 'r' in the formula for the sense-feature matrix calculated?

For example given

Java#NP 4   Windows#NP, Linux#NP,...

and

Windows#NP  1   Linux#NP, OS#NP,
Windows#NP  2   Console#NP,

then Java#NP#4 and Windows#NP#1 have Linux#NP in common, so the value should be > 0 (if i got it right). I would expect something like common_words_in_both_clusters/(number_of_words_in_cluster_1 + number_of_words_in_cluster_2).

alexanderpanchenko commented 9 years ago

In the original work of Chris (and as implemented in jobimtext.org) similarity of two objects (in our case of two senses) is simply the number of common features. As far as I remember, here https://github.com/tudarmstadt-lt/noun-sense-induction-scala number of common features is normalized by number of features per word (sense in you case i.e. 200):

Implementation is here: https://github.com/tudarmstadt-lt/noun-sense-induction-scala/blob/master/src/main/scala/WordSimUtil.scala#L232

Note that normalization is not really needed because most of the words (senses) have the same number of features.

ghost commented 9 years ago

Found some problematic lines while testing: For example line 9953 in senses-n30-1600k contains the key &amp\t#VB which is currently interpreted as '&amp\ and '#VB'. -> I would spent some more logic to handle escaped csv-separators for this case

The computation of the sense-feature-matrix for senses-n30-1600k takes around 40 Minutes and most of my memory at the moment. I am going to check, if i can adapt it for spark.

alexanderpanchenko commented 9 years ago

I have a slightly more cleaned versions of the files here (some null bytes and other wierd characters were removed):

http://panchenko.me/data/joint/senses-wiki-n30-1600k.csv.gz http://panchenko.me/data/joint/senses-wiki-n200-380k.csv.gz

Feel free to use these and apply any extra filter on the top of these. As to these filtering it is better to make a standalone class/utility for the cleanup (takes senses file as input and outputs a cleaned version of the file). In this way other programs dealing with the senses will benefit the cleaned version.

alexanderpanchenko commented 9 years ago

I looked though your current code. Looks good. Here a couple of comments:

ghost commented 9 years ago

It's far from being clean code but works for the first start :) I want to add some api documentation and maybe a standalone-jar with command-line args later for better usage.

At the moment it loads all data in memory (~4.6GB) then performs the calculation: My first approach was n^2 with a double loop over all lines which was way too expensive. Now i am using a HashMap to lookup possible clusters for a word.

For each cluster and each word in the cluster: lookup all senses for the word and calculate the similarity.

Taking n as number of clusters, m as the max. number of words/cluster and s the maximal number of senses for one word, it should run approx in: ns (iterate over all clusters) * m (each word) * ~1 (lookup other senses in hashtable) * s (for all senses) \ ~ m^2 (similarity matching)

For senses senses-n30-1600k we have n: 258849 m: 200 s: 36

I think the similarty matrix output file had something like 2.9GB

alexanderpanchenko commented 9 years ago

For each cluster and each word in the cluster: lookup all senses for the word and calculate the similarity.

So this is something like an inverted index of senes given a word, right?

ghost commented 9 years ago

Kind of. The idea was to use the same datatype for the sense-word and the words in the cluster.

I am currently trying to implement the same algorithm in spark which seems to be a problem of finding the right data-structure and operations on the rdd. Major issue at the moment is the lookup from a cluster-word to a sense with the same word, so e.g. for A#1 B,... i want to lookup B#1, B#2, ...B#n and their cluster-words. My first try was a PairRdd with (String, Cluster) but a lookup in the rdd is not possible while using a function on it. At the moment i am testing a cartesian of the rdd with itself + filter but i guess this will run forever... I am pretty sure, this can be solved more elegant with a map/join/... to do the lookup before processing the data.

alexanderpanchenko commented 9 years ago

OK. As to Spark, I recommend here rather to use the current implementation extending it if needed. We will have to implement some other stuff which is not implemented yet with Spark e.g. graph clustering which is local now.

alexanderpanchenko commented 9 years ago

useful commands

htop
sort -k 1,1 -k 3,3rn sim.txt > ~/sim.txt
alexanderpanchenko commented 9 years ago

Frequency dictionary:

http://panchenko.me/data/joint/word-freq-news.gz

MapReduce for similarity computation

alexanderpanchenko commented 9 years ago

Another tip on your project:

Use Gzip to store all big files in the project. Use readers/writers that can read Gzip. So instead sim.txt use sim.txt.gz and so on. This saves a great deal of space and comes at very little CPU cost.

Editors like VIM, but also other cli tools can open Gzip files without any problem.

alexanderpanchenko commented 9 years ago
~/.bashrc
alias java8='/path/to/java/bin'
alexanderpanchenko commented 9 years ago
Refrain#NP   word#NP    1
Refrain#NP#2   word#NP    1
alexanderpanchenko commented 9 years ago

vim replace

:%s/\\t/\t/gc
alexanderpanchenko commented 9 years ago
zcat == gzip -dc 
zcat sim.txt.gz | sort -k 1,1 -k 3,3rn > ~/sim-sorted.txt
gzip sim-sorted.txt
alexanderpanchenko commented 9 years ago

top 100000, sorted

head -100000 sim.txt | sort -k 3,3rn | less

number of lines in a file

wc -l sim.txt
ghost commented 8 years ago

Some output from the clustering: I used senses-wiki-n200-380k with my own similarity implementation and the cw-clustering from https://github.com/johannessimon/chinese-whispers with -n 100 -N 100.

coyote#NN#0     0       heron#NN#0      fox#NN#0:800.3779  deer#NN#0:783.01587  raccoon#NN#0:799.0669  wolf#NN#0:780.0963  squirrel#NN#0:790.4912  bear#NN#0:779.046  rabbit#NN#0:782.8652  beaver#NN#0:793.9739  bobcat#NN#0:805.51215  badger#NN#0:801.27356  otter#NN#0:785.91223  elk#NN#0:774.834  hare#NN#0:785.3174  eagle#NN#1:777.9281  owl#NN#0:773.1724  boar#NN#0:772.19446  opossum#NN#0:797.9657  weasel#NN#0:798.4339  mink#NN#0:796.2611  marten#NN#0:797.6344  lynx#NN#0:790.745  moose#NN#0:780.4941  porcupine#NN#0:780.60724  lion#NN#0:755.18774  cougar#NN#0:792.54443  armadillo#NN#0:776.35675  skunk#NN#0:791.8093  cat#NN#0:760.17334  jackal#NN#0:781.3113  alligator#NN#0:771.1817  hyena#NN#0:733.192  snake#NN#0:730.9533  bison#NN#0:751.5092  muskrat#NN#0:786.5424  grouse#NN#0:760.3753  vulture#NN#0:732.1771  rat#NN#0:729.4525  pheasant#NN#0:759.6551  rattlesnake#NN#0:770.11224  turkey#NN#0:0.99560744  turkey#NN#1:904.19653  hawk#NN#0:754.0704  monkey#NN#0:748.42035  tiger#NN#0:734.30756  goat#NN#0:0.35486722  goat#NN#1:811.03345  antelope#NN#0:737.6167  jaguar#NN#0:754.5465  quail#NN#0:753.4215  leopard#NN#0:721.5583  pig#NN#0:0.3538233  pig#NN#1:768.16534  vole#NN#0:775.45447  shrew#NN#0:771.6228  ferret#NN#0:795.57263  lizard#NN#0:688.4316  skunks#NP#0:804.708  sheep#NN#0:787.8779  turtle#NN#0:719.3788  heron#NN#0:592.68427  caribou#NN#0:766.0515  fowl#NN#0:816.80444  fowl#NN#1:0.3538233  wolverine#NN#1:862.02484  waterfowl#NN#0:743.0547  dog#NN#0:711.6682  kangaroo#NN#0:737.96594  woodpecker#NN#0:671.51154  marmot#NN#0:788.01184  hedgehog#NN#0:774.3194  chipmunk#NN#0:787.8819  jackrabbit#NN#0:794.1487  bird#NN#0:711.64075  raptor#NN#0:730.85175  reindeer#NN#0:773.11597  mongoose#NN#0:781.0858  iguana#NN#0:741.0136  puma#NN#0:793.4733  stork#NN#0:688.5844  fisher#NN#1:924.13666  duck#NN#0:675.712  pronghorn#NN#0:784.58496  partridge#NN#0:754.85583  peacock#NN#0:726.46655  gull#NN#0:523.87335  crow#NN#0:750.8598  possum#NN#0:776.09656  sloth#NN#0:741.8031  elephant#NN#0:676.96204  falcon#NN#0:711.64044  mouse#NN#0:689.9242  python#NN#0:724.5709  tapir#NN#0:745.0664  giraffe#NN#0:708.752  bat#NN#0:762.07477  crocodile#NN#0:678.89294  kingfisher#NN#0:574.69586  sparrow#NN#0:605.25476  rodent#NN#0:694.44  jay#NN#0:703.6276

gem#NN#0        0       agate#NN#0      gemstone#NN#0:262.01654  gemstone#NN#1:53.38772  diamond#NN#0:200.9922  gold#NN#0:3.9127073  gold#NN#1:212.11691  pearl#NN#0:254.55843  pearl#NN#1:49.50488  stone#NN#0:130.6642  jade#NN#0:194.87483  ruby#NN#0:204.8426  ruby#NN#1:4.0736437  silver#NN#0:4.03541  silver#NN#2:272.16336  silver#NN#3:77.30699  emerald#NN#0:4.1446347  emerald#NN#1:73.76355  emerald#NN#2:260.45782  ivory#NN#0:238.59836  ivory#NN#1:30.808867  turquoise#NN#1:17.809385  turquoise#NN#2:292.25256  sapphire#NN#0:175.97687  glass#NN#0:146.00334  marble#NN#0:190.97966  ceramic#NN#0:217.97165  ceramic#NN#1:49.043785  crystal#NN#0:143.8938  fur#NN#1:10.215885  fur#NN#2:173.11418  silk#NN#0:110.002625  textile#NN#0:88.33784  porcelain#NN#0:159.39062  enamel#NN#0:160.38957  timber#NN#0:164.23375  timber#NN#1:64.57974  wood#NN#0:119.7911  copper#NN#0:141.80215  ore#NN#0:7.4206085  ore#NN#1:123.86437  mineral#NN#0:95.66668  spice#NN#0:32.623753  goods#NN#0:72.343666  cloth#NN#0:2.9034278  cloth#NN#1:154.67143  metal#NN#1:132.3641  salt#NN#0:69.122505  salt#NN#1:22.121464  coal#NN#0:60.30179  coal#NN#1:139.80408  china#NN#0:41.841553  china#NN#1:33.318485  china#NN#2:255.91545  shell#NN#0:1.901605  shell#NN#1:0.52294  shell#NN#2:174.90643  shell#NN#3:19.632708  bronze#NN#1:167.0403  flint#NN#0:160.93593  opal#NN#0:194.30475  opal#NN#1:4.03541  leather#NN#0:3.4198134  leather#NN#1:156.11583  agate#NN#0:183.29579  feather#NN#0:49.5295  clay#NN#0:124.498375  mica#NN#0:157.49162  obsidian#NN#0:195.2422  amber#NN#0:302.4613  amber#NN#2:3.4198134  amber#NN#3:18.802755  quartz#NN#0:148.51817  plate#NN#1:93.07925  plate#NN#2:7.3387656  lazulus#NN#0:193.6699  lazulus#NN#1:4.161014  bullion#NN#0:124.525406  bullion#NN#1:66.29011  lacquer#NN#0:133.54341  iron#NN#0:125.16192  granite#NN#0:240.20383  granite#NN#1:40.354095  nugget#NN#0:98.5307  pebble#NN#0:121.920044  bone#NN#0:143.40997  bone#NN#2:17.126678  tin#NN#0:143.73773  nickel#NN#0:98.62473  tile#NN#0:91.88612  brass#NN#0:204.99788  wool#NN#0:73.34005  blue#NN#0:27.205889  blue#NN#1:27.653355  blue#NN#2:40.31206  tobacco#NN#0:35.02309  rubber#NN#0:122.90308  rubber#NN#1:67.49288  garnet#NN#0:158.07266  garnet#NN#1:4.176753  topaz#NN#0:166.12654  amethyst#NN#0:217.67023

alien#NN#2      3       saucer#NN#1     robot#NN#3:17.55077  saucer#NN#1:5.936978  saucer#NN#2:2.144787  saucer#NN#3:97.43907  spaceship#NN#2:12.148091  mecha#NN#2:13.228989  meteor#NN#2:6.9318075

How do i have to interpret the cluster-lables (third column)? E.g. for the cluster

gem#NN#0        0       agate#NN#0      gemstone#NN#0:262.01654  gemstone#NN#1:53.38772  diamond#NN#0:200.9922

agate does not seem to be an appropriate label.

alexanderpanchenko commented 8 years ago

but as we discovered your implementation seems to limit choice of candidates to the related words, isn't it?

what about the Spark implementation of the similarity. Did you calculate similarity of senses with this one? It would be nice to compare the results.

as to the label, for the moment we are not interested in it.

alexanderpanchenko commented 8 years ago

please create a bash script with commands you used to perform the clustering for reproducibility in future and for easier automation (e.g. for meta-parameter optimization). something like:

java -jar file.jar  ... -n 100 -N 100 ...
alexanderpanchenko commented 8 years ago

These results looks reasonable. We will look into more of them tomorrow.

alexanderpanchenko commented 8 years ago
  1. Number of elements in the cluster
  2. Sorting by first and the third column: sort -k 1,1 -k 3,3rn
  3. MCL: https://github.com/tudarmstadt-lt/chinese-whispers
  4. Try binarized version of the graph (replace weights)
  5. Using another input graphs.
  6. Different numbers of NN: 10, 25, 50, 75, 100, 200.
ghost commented 8 years ago

I started some pipelines with different settings on frink. Format of the output-folders:

<input_ddt>_<word_frequency>_<numer_of_similar_senses>_<timestamp>

The ddt-news-*-closure input files are not exactly the same format as the previous ones: -"," instead of ", " as separator (adapted the parser) -weights separated by ":" (adapted the parser, but introduced some issues with words containing ":") -cluster-words containing linebreaks (loosing all data after the linebreak) -senses without any cluster-words

I will take some more time to inspect the data and add Markov Chain Clustering the next days.

alexanderpanchenko commented 8 years ago

OK, great. By the way, we need to postpone our meeting on Friday -- I will be travelling. Can we meet next Tuesday instead at 14:00? Please upload new clustering results, so I can still inspect them.

alexanderpanchenko commented 8 years ago

https://perso.uclouvain.be/vincent.blondel/research/louvain.html