Implementation of the initial prototype

alexanderpanchenko commented 9 years ago

Motivation

We need to develop a prototype of the system that builds structured topics model and is able to label new texts according to these topics. This vertical prototype is supposed to have all minimal functionality of the system (input/output) and implemented with the most straightforward set of algorithms. The goal is to make initial validation of the idea and then improve the prototype gradually. In this step we do no evaluation which will be done later as well (during the "official" 6 month period reserved for writing the thesis).

An important point also is to make a preliminary evaluation of the prototype (to show that the quality can be measurable).

Implementation

The prototype will build structured topics out of sense similarity graphs. These graphs were built automatically using distributional semantics methods (http://maggie.lt.informatik.tu-darmstadt.de/jobimtext/documentation/distributional-semantics/).

The overall pipeline of the prototype (to be implemented in Java/Scala or a mix of both):

Download the data -- a Disambiguated Distributional Thesaurus (DDT) build from the JoBimText and AdaGram models:
Frequency dictionary: http://panchenko.me/data/joint/word-freq-news.gz -- to filter the graph.

The data have the format word cid prob cluster isas
Cluster the graphs of sense similarities using Chinese Whisper (CW), Markov Chain Clustering (MCL) and Louvain Method (LM).

For the first two algorithms use this implementation: https://github.com/johannessimon/chinese- whispers. Alternatively you can use this implementation: http://maggie.lt.informatik.tu-darmstadt.de/jobimtext/documentation/sense-clustering/ . Use: CW: https://github.com/johannessimon/chinese-whispers/blob/master/src/main/java/de/tudarmstadt/lt/cw/CW.java

MCL: https://github.com/johannessimon/chinese-whispers/blob/master/src/main/java/net/sf/javaml/clustering/mcl/MarkovClustering.java

For the LM use any available implementation e.g. https://perso.uclouvain.be/vincent.blondel/research/louvain.html.

Description of CW is available here: http://wortschatz.uni-leipzig.de/~cbiemann/pub/2006/BiemannTextGraph06.pdf

The output of clustering shall look like this:
```
structured-topici-id<TAB>sense-id-1,sense-id-2,sense-id-3,...
```
To make each topics more readable, assign 3 frequent hypernyms to the senses (topic-labels). The set of hypernyms will be provided.

In addition, for each topic label, find URL of the image that depicts it from DBpedia (for instance http://dbpedia.org/page/Berlin). The images are located in the field: topic-label-image-urls. Each word in case of ambiguity (http://dbpedia.org/page/Python) should be disambiguated. The output shall look like this:
```
 structured-topici-id<TAB>topic-labels<TAB>topic-label-image-urls<TAB<sense-id-1,sense-id-2,sense-id-3,...
```
Evaluate interpretability of the topics by taking at random 100 topics and annotating them as "interpretable", "not interpretable" or "mixed".

5 . Make a basic classification module that would use the structured topics, being clusters of senses, to annotate text documents. The module should

load the structured topics

structured-topici-id<TAB>cluster-word-1,cluster-word-2,cluster-word-3,...

for an input document output a set of most relevant topics
each output topic should have a confidence of the classification

To implement this module you should use ElasticSearch index. One topic would be one document, and then use an input document as search query. The retrieval system will return a list of documents (topics) according to their TF-IDF score.

How scoring of ElasticSearch works:
- https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
- https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
  1. Evaluate quality of the topic categorization by comparing it to set of Wikipedia categories. In particular, you are going to use measures that quantify quality of clustering (purity, inverse purity and so on). See http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html. The golden clustering will be the set of categories of Wikipedia articles. The predicted clustering will be the set of structured topics assigned to the articles.

alexanderpanchenko commented 9 years ago

http://www.linuxproblem.org/art_9.html

ssh frink@lt...
htop
scp file.txt frink: 
wget

smndf commented 9 years ago

Hi Alexander, To run programs on the frink computer, should I wait until it is not busy by other programs running? For example, now, only 2 cores are free. By the way, I think I am done with the Chinese Whispers algo (and the overall setup), I will start working on the two other algos the next days.

alexanderpanchenko commented 9 years ago

Hi, indeed Frink is fairly busy now. Now it makes sense to use 2-4 free cores. If you have problems with memory, this may be helpful. Otherwise if everything fits into your 8Gb, just compute locally.

smndf commented 9 years ago

Hello, Regarding the Louvain Method, I haven't found any java implementation, there is only this C++ one : https://github.com/riyadparvez/louvain-method/blob/master/README.md

Do I have to rewrite everything in Java or should I use this implementation for now and just adapt the input file ?

alexanderpanchenko commented 9 years ago

You can use this one: https://github.com/gephi/gephi/blob/142c8b58fc05107577720602391e4f608a5f3afd/modules/StatisticsPlugin/src/main/java/org/gephi/statistics/plugin/Modularity.java

But actually right now we do not really care if it is in C++ of Java. So whatever faster to get results.

smndf commented 9 years ago

Hello Alexander, for the 3rd step, "The set of hypernyms will be provided", do you have it ? or do I need to use WordNet for example to find some ? I finally used the C++ implementation for the Louvain Method, the best parameters still need to be found (for the other algos too) but it seems to work.

alexanderpanchenko commented 9 years ago

Right now, just use WordNet. I need to ask a colleague for automatically extracted relations, but the main difference would be the coverage.

alexanderpanchenko commented 9 years ago

ISA relations on frink machine: /home/panchenko/isas

Use them in addition to WordNet stuff.

smndf commented 9 years ago

Hi Alexander so, so far I have :

before running the algos I filter with the Frequency dictionary you provided (I take words appearing more than 10 times but maybe it's too low)
3 algorithms, but I cannot test MCL on big data files because it takes too much time to build the graph matrix (before actually running the MCL algo) for CW I could only use a file with 160000lines out of the 460000 on the original data file due to OOM LM works (although I am not sure about the best parameter value) and it is the fastest algo by far
for each cluster, I assign 3 hypernyms using only WordNet but ISAS will be useful, for proper nouns especially
I haven't done the image labeling part yet
ElasticSearch : clusters are loaded (1 cluster = 1 document in the index) and searches look fine

Should we meet so that I can present you what I have done ?

alexanderpanchenko commented 9 years ago

what is OOM LM?

yes, we clearly shall meet to discuss your progress. how about 14:00 this friday?

please post sample clustering here and upload the full version to Frink (CSV files)

smndf commented 9 years ago

Ah I forgot a period, I wanted to write "... due to OOM (out of memory). LM (Louvain Method) works..."

ok for friday at 14:00

smndf commented 9 years ago

Here are 3 clustering examples using Chinese Whispers, with the 160000 first lines of "ddt-news-n50-485k-closure.csv", removing the words appearing less than 11 times in the frequency dictionary

see Google doc

alexanderpanchenko commented 9 years ago

please provide also clusters containing all senses of words "jaguar, python, java, ruby"
you should index with ElasticSearch only words w/o #NP#1 thing.

alexanderpanchenko commented 9 years ago

OOM with CW appears on your local machine? use Frink, it has 100Gb of RAM. feel free to use all free memory. if needed we have a server with 256Gb of RAM

smndf commented 9 years ago

see Google doc

smndf commented 9 years ago

is there a better way to display them ?

Yes, when I index them in ElasticSearch I remove the #NP#1 part

I tried to use Frink but it was slow (because of the processes running I guess), slower than on my MacBook

alexanderpanchenko commented 9 years ago

Yes, it is very busy now. But if the problem is memory just launch it overnight :-)

alexanderpanchenko commented 9 years ago

Use the Google doc for your clusters to better visualize them (check your gmail). Use text wrapping. Strip the POS and sense ID. Add a space after each comma. Use one tab per one clustering. Generate a CSV and then just import it to googledoc (or part of it if it is huge).

smndf commented 9 years ago

I tried to run MCL on Frink (after improving the code as we discussed), I got an OOM error, it seems that we are limited to 25GB of RAM per user.

alexanderpanchenko commented 9 years ago

Strange, i didn't know about such limitations. Just try again. Meanwhile, I will ask colleagues about this limits.

smndf commented 9 years ago

Yes I did, when I run it and I look at htop, I see the memory used by the process increasing rapidly until 25GB and then the process disappears from htop because it stopped with the OOM error

alexanderpanchenko commented 9 years ago

I asked: frink has no limits, but make sure about xmx and that enough memory is available.

alexanderpanchenko commented 9 years ago

for MCL try http://www.micans.org/mcl/ or the MarkovClustering.java (not MarkovClustering2.java), the former relies on SparseMatrix

alexanderpanchenko commented 9 years ago

MarkovClustering2 relies on float[][] and thus is not good for the full graph

smndf commented 9 years ago

Hello,

Regarding MCL, I used the SparseMatrix, it is running right now on Frink and it seems to take a lot of time, so wait & see.

As we said last week, I filtered the input file to keep NN NP JJ and I added also RB (adverbs), I reran CW and LM but I am not sure it gives better clustering, maybe it would be better to perform the clustering on the original file without filtering and filter the POS we want afterwards, so that there is more information to "help" the clustering algo, what do you think ?

I also ran LM with several parameters (the different "layers" in the hierarchy of communities), but here again I am not sure which is better or worse.

when I say I am not sure that it is better, I mean that I don't see it just by looking at the clusters, I think we now need the evaluation part to make sure a modification improves the results, so I did what we discussed : for a random article in wikipedia as input (only the abstract, accessed via DBpedia), I use the ElasticSearch index to output the 3 most relevant clusters. This part is working, but then to compare it to wikipedia categories I am not sure how to do it, because there is no mapping between wikipedia categories and the topics we predict, or have I misunderstood it ?

alexanderpanchenko commented 9 years ago

Yes, long running time of MCL is expected.

Please share the results if you already have them: upload to frink and send link here and/or upload to Google doc. So I can take a look at the results.

Regarding LM -- again share the results, I need to take a look.

Let us meet next week and discuss evaluation at this point. By this time, you goal basically is to generate as many different configurations of clustering as possible and if possible try to identify the best one just by looking though the data. Also it is important to fix the thing with hypernyms.

Index several most prominent sets of clusters and we will look together at the query results.

What about next Wednesday at 16:00?

smndf commented 9 years ago

Ok I will upload the files, sorry.

What do you mean with "fix the thing with hypernyms" ? To make them be more accurate/relevant ?

Ok no problem for Wednesday at 16:00, just, then I will be in France for one or two weeks, but still working on the thesis.

alexanderpanchenko commented 9 years ago

by fixing I mean changing the current logic: you need only to take into account wordnet hypernyms and isas.

so we need to skype on Wednesday?

smndf commented 9 years ago

Ok I've already changed it.

No I will be there on Wednesday but after I go to France for one or two weeks.

alexanderpanchenko commented 9 years ago

OK. Please also commit latest version of your code and scripts (e.g. for launching LM) to the repository.

alexanderpanchenko commented 9 years ago

size of the cluster
support of hypernym (percent = freq/size)
generate results for all files and save on frink:

http://panchenko.me/data/joint/ddt-adagram-ukwac+wacky-476k-closure-v3.csv.gz http://panchenko.me/data/joint/ddt-wiki-n200-380k-v3-closure.csv.gz http://panchenko.me/data/joint/ddt-wiki-n30-1400k-v3-closure.csv.gz http://panchenko.me/data/joint/ddt-news-n200-345k-closure.csv.gz http://panchenko.me/data/joint/ddt-news-n50-485k-closure.csv.gz

commit the source code

better filtering:

centrality metrics for each cluster: betweenness, PageRank, degree, ... (add as an extra column three nodes that are most central)
number of triangles in the graph (add extra column)
https://en.wikipedia.org/wiki/Clustering_coefficient

alexanderpanchenko commented 9 years ago

evaluation: http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html

smndf commented 9 years ago

Hello,

I am back in Darmstadt, I haven't worked as much as I wanted to last week but so far from the TODO list, I have :

the size of the cluster,
support of hypernyms
eigenvector centrality
number of triangles and the clustering coefficient for each cluster

I haven't tried with other input files and there are the other centrality metrics left (although I think maybe it would take too much time to compute betweenness and closeness as they need all pairs shortest paths in the cluster)

But I don't have so many results yet because I get an OOM error on Frink (when using 25Gb of RAM), can I use more than this or can I get access to the other server with more RAM that you mentioned ?

alexanderpanchenko commented 9 years ago

Yes, you can use more memory. Try not to exceed 50Gb for long running jobs and 75Gb for shortrunning jobs. If you need more I can provide you access to another server.

Please keep me updated, the plan is OK. Please add new results (with new columns) to the Google doc spreadsheet, so I can quickly analyse the results. The main goal is to get some metrics that would filter big general clusters and keep only mid-sized tightly semantically connected ones.

smndf commented 9 years ago

I fixed the memory problem for the number of triangles and triplets and I am running CW+clustering coef with every input file. So far I have already 2 results that I uploaded on the google doc For example : CWddt-wiki-n30-1400k-v3-closureFiltBef2.txt means :

algo CW
file ddt-wiki-n30-1400k-v3-closure.csv
FiltBef2 = filtered non frequent words and kept only JJ,NN,NP before running (unlike FiltBef1 which means filtered nothing, just a bit of refactoring for CW)

In the google doc, I put for each cluster : size of the cluster, number of triangles and of triplets and the clustering coefficient that is computed with them

I haven't included the eigenvector centrality nor any centrality metrics because as it is now, it takes too much memory (for example the matrix product for the eigenvector centrality) that's why what is on the Google doc is just raw clusters : the step before hypernyms.

I will also upload the files on Frink in the directory simondif/structured-topics/results

alexanderpanchenko commented 9 years ago

Hi, Great. Thanks for the updates. Why it takes much memory. Normally you only need to load one cluster a time which is not so big. Try more efficient implementations. These are extreamly efficient ones: https://graph-tool.skewed.de/ http://igraph.org/python/doc/tutorial/tutorial.html

smndf commented 9 years ago

Ok, actually I haven't tried my implementation yet but I guess it would take a lot of memory as it needs matrices whose row and column sizes are the cluster size and we have clusters with more than 10000 words. So far I try to focus more on the quality of clusters before the quality of hypernyms because it will be easier to handle hypernyms once we have good clusters I think. By the way, for the metrics, the two links refer to python modules, do you know if there is one in java ? I think it would be more convenient for me.

In the meantime, I am adding new results files of clustering on the google doc and Frink. Have you already thought about how to filter them ? I mean : should we apply some "mathematical" method or just try and see what filtering gives best results ? Like for example keep clusters whose sizeMin < size < sizeMax and clustercoef > clustercoefMin

alexanderpanchenko commented 9 years ago

eigenvalue centrality is not "must have" not, so indeed better not to involve ourselves into potentially heavy computations now.

So far I try to focus more on the quality of clusters before the quality of hypernyms because it will be easier to handle hypernyms once we have good clusters I think.

i have the following observation from your results. good clusters have good hypernyms i.e. hypernyms that are not too general and are semantically related. so, maybe actually hypernyms can be a great deal for information filtering as well

By the way, for the metrics, the two links refer to python modules, do you know if there is one in java ? I think it would be more convenient for me.

i am not aware of as efficient java modules. these two are hardcore C/C++ implementation that are very memory savvy.

In the meantime, I am adding new results files of clustering on the google doc and Frink. Have you already thought about how to filter them ? I mean : should we apply some "mathematical" method or just try and see what filtering gives best results ?

i will take a closer look next week and let you know. preferably we need to have some mathematically-grounded method that optimizes some meaningful cost function.

smndf commented 9 years ago

i have the following observation from your results. good clusters have good hypernyms i.e. hypernyms that are not too general and are semantically related. so, maybe actually hypernyms can be a great deal for information filtering as well

ok so I will compute hypernyms as well but now I am running CW and LM with all the files again, because I spotted a bug in my code that induced some modifications in the clusters.. it should take until tomorrow to have all the files again

i will take a closer look next week and let you know. preferably we need to have some mathematically-grounded method that optimizes some meaningful cost function.

I saw what you've done on the google doc with the rank rank2 columns, I noticed that in some cases the cluster. coef is good when the cluster is pretty bad. This happens when there are few connected triplets in the cluster so the cluster. coef value (= number of triangles / number of triplets) is high. Maybe we could use another coef. instead : instead of the number of triplets seen in the cluster, we could use the number of triplets that could be in the cluster, given its size. It would give us : cluster. coef. 2 = number of triangles / max number of triplets (or triangles) with max number of triplets = binomial coefficient (3,cluster size) I tried it on the google doc (in LMddt-wiki-n30-1400k-v3-closureFiltBef1FiltAft.txt), I think it should represent the cluster quality better than the normal clustering coef. (?)

alexanderpanchenko commented 9 years ago

regarding weighting schema (rank, rank2) this was just a very first modest attempt to rank the clusters. please feel free to add your own ranking formulas. Right now, they can be just ad-hoc combinations of factors, but later i would go for a machine learning model that would learn the coefficients.

i think two powerful factors may be:

average pairwise similarity of top 3 hypernyms
average depth of top 3 hypernyms in the wordnet tree (lower terms are more specific and we want to avoid generic clusters)

it would be great if you can add these two factors by our next meeting.

smndf commented 9 years ago

Ok but these two factors don't work with ISAS, only with hypernyms from wordnet I guess ?

alexanderpanchenko commented 9 years ago

the first one works, the second is not

alexanderpanchenko commented 9 years ago

we need to have a next meeting next week and also a meeting with professor preferably before the end of the yesr. prepare the new results and let us meet next wenesday at 17:00.

smndf commented 9 years ago

ok no problem. the average depth is done, for the semantic similarity should I use this method http://arxiv.org/pdf/cmp-lg/9511007.pdf or just the path length between two words ?

alexanderpanchenko commented 9 years ago

you can actually just use the distributional thesaurus provided to you as input (it is a graph of word_i word_j similarity_ij). alternatively, yes, you can use wordnet based similarity measures. yes, the measure of Resnik is a good one based on WordNet. check this lib or other one: https://code.google.com/p/ws4j/

smndf commented 9 years ago

do you mean the ddt files I use to get the clusters ? if so, as I don't know the "sense number" for the hypernyms in this file should I try to find a relationship between each pair of senses for two hypernyms (hyp1sense1/hyp2sense1, hyp1sense1/hyp2sense2, hyp1sense2/hyp2sense2 etc.) ? but if the relationship is not direct, e.g. we have hyp1sense1-word3sense1 and word3sense1-hyp2sense2 then similarity(hyp1,hyp2)=similarity(hyp1,word3) x similarity(word3,hyp2) ?

alexanderpanchenko commented 9 years ago

yes. right, you do not know sense number, but you can just use similarity of words.

assume three hypernyms x,y, z of a cluster

then we find in the ddt sim_xy, sim_xz, sim_yz and calculate 0.33 * (sim_xy + sim_xz + sim_yz)

else you can get these sim_xy ... from wordnet

smndf commented 9 years ago

I implemented the semantic similarity for hypernyms as you said i.e. either coming from ddt files or from wordnet. I chose Wu&Palmer measure for wordnet because it's between 0 and 1 like the measure in the ddt files. But actually I think the "between 0 and 1" criterion is not good enough: as we will later compare measures from either ddt files or wordnet, the two measures should be the same or at least have the same frequency distribution, is it not a problem for uns ?

alexanderpanchenko commented 9 years ago

I don't think 0-1 is a problem, you can re-normalize at any time. I would also try JCN or Resnik instead, they normally give slightly better results.

But actually I think the "between 0 and 1" criterion is not good enough: as we will later compare measures from either ddt files or wordnet, the two measures should be the same or at least have the same frequency distribution, is it not a problem for uns ?

Not sure what do you mean here.

Can you pull the results to the google docs and try to sort the clusters right there using a combination of factors in the fashion I done with rank/rank2? In this way we will have more to discuss.

smndf commented 9 years ago

Ok I will try other measures.

Not sure what do you mean here.

If two measures (normalized between 0 and 1) don't have the same distribution, maybe for one measure 0.6 is good and for the second one 0.6 is bad. And when we will compare scores, we will mix both measures so our analysis could be a bit wrong because of this.

smndf / structured-topics

Implementation of the initial prototype #1

Motivation

Implementation