Closed alexanderpanchenko closed 8 years ago
http://www.linuxproblem.org/art_9.html
ssh frink@lt...
htop
scp file.txt frink:
wget
Hi Alexander, To run programs on the frink computer, should I wait until it is not busy by other programs running? For example, now, only 2 cores are free. By the way, I think I am done with the Chinese Whispers algo (and the overall setup), I will start working on the two other algos the next days.
Hi, indeed Frink is fairly busy now. Now it makes sense to use 2-4 free cores. If you have problems with memory, this may be helpful. Otherwise if everything fits into your 8Gb, just compute locally.
Hello, Regarding the Louvain Method, I haven't found any java implementation, there is only this C++ one : https://github.com/riyadparvez/louvain-method/blob/master/README.md
Do I have to rewrite everything in Java or should I use this implementation for now and just adapt the input file ?
You can use this one: https://github.com/gephi/gephi/blob/142c8b58fc05107577720602391e4f608a5f3afd/modules/StatisticsPlugin/src/main/java/org/gephi/statistics/plugin/Modularity.java
But actually right now we do not really care if it is in C++ of Java. So whatever faster to get results.
Hello Alexander, for the 3rd step, "The set of hypernyms will be provided", do you have it ? or do I need to use WordNet for example to find some ? I finally used the C++ implementation for the Louvain Method, the best parameters still need to be found (for the other algos too) but it seems to work.
Right now, just use WordNet. I need to ask a colleague for automatically extracted relations, but the main difference would be the coverage.
ISA relations on frink machine: /home/panchenko/isas
Use them in addition to WordNet stuff.
Hi Alexander so, so far I have :
Should we meet so that I can present you what I have done ?
what is OOM LM?
yes, we clearly shall meet to discuss your progress. how about 14:00 this friday?
please post sample clustering here and upload the full version to Frink (CSV files)
Ah I forgot a period, I wanted to write "... due to OOM (out of memory). LM (Louvain Method) works..."
ok for friday at 14:00
Here are 3 clustering examples using Chinese Whispers, with the 160000 first lines of "ddt-news-n50-485k-closure.csv", removing the words appearing less than 11 times in the frequency dictionary
see Google doc
OOM with CW appears on your local machine? use Frink, it has 100Gb of RAM. feel free to use all free memory. if needed we have a server with 256Gb of RAM
see Google doc
is there a better way to display them ?
Yes, when I index them in ElasticSearch I remove the #NP#1 part
I tried to use Frink but it was slow (because of the processes running I guess), slower than on my MacBook
Yes, it is very busy now. But if the problem is memory just launch it overnight :-)
Use the Google doc for your clusters to better visualize them (check your gmail). Use text wrapping. Strip the POS and sense ID. Add a space after each comma. Use one tab per one clustering. Generate a CSV and then just import it to googledoc (or part of it if it is huge).
I tried to run MCL on Frink (after improving the code as we discussed), I got an OOM error, it seems that we are limited to 25GB of RAM per user.
Strange, i didn't know about such limitations. Just try again. Meanwhile, I will ask colleagues about this limits.
Yes I did, when I run it and I look at htop, I see the memory used by the process increasing rapidly until 25GB and then the process disappears from htop because it stopped with the OOM error
I asked: frink has no limits, but make sure about xmx and that enough memory is available.
for MCL try http://www.micans.org/mcl/ or the MarkovClustering.java (not MarkovClustering2.java), the former relies on SparseMatrix
MarkovClustering2 relies on float[][] and thus is not good for the full graph
Hello,
Regarding MCL, I used the SparseMatrix, it is running right now on Frink and it seems to take a lot of time, so wait & see.
As we said last week, I filtered the input file to keep NN NP JJ and I added also RB (adverbs), I reran CW and LM but I am not sure it gives better clustering, maybe it would be better to perform the clustering on the original file without filtering and filter the POS we want afterwards, so that there is more information to "help" the clustering algo, what do you think ?
I also ran LM with several parameters (the different "layers" in the hierarchy of communities), but here again I am not sure which is better or worse.
when I say I am not sure that it is better, I mean that I don't see it just by looking at the clusters, I think we now need the evaluation part to make sure a modification improves the results, so I did what we discussed : for a random article in wikipedia as input (only the abstract, accessed via DBpedia), I use the ElasticSearch index to output the 3 most relevant clusters. This part is working, but then to compare it to wikipedia categories I am not sure how to do it, because there is no mapping between wikipedia categories and the topics we predict, or have I misunderstood it ?
Yes, long running time of MCL is expected.
Please share the results if you already have them: upload to frink and send link here and/or upload to Google doc. So I can take a look at the results.
Regarding LM -- again share the results, I need to take a look.
Let us meet next week and discuss evaluation at this point. By this time, you goal basically is to generate as many different configurations of clustering as possible and if possible try to identify the best one just by looking though the data. Also it is important to fix the thing with hypernyms.
Index several most prominent sets of clusters and we will look together at the query results.
What about next Wednesday at 16:00?
Ok I will upload the files, sorry.
What do you mean with "fix the thing with hypernyms" ? To make them be more accurate/relevant ?
Ok no problem for Wednesday at 16:00, just, then I will be in France for one or two weeks, but still working on the thesis.
by fixing I mean changing the current logic: you need only to take into account wordnet hypernyms and isas.
so we need to skype on Wednesday?
Ok I've already changed it.
No I will be there on Wednesday but after I go to France for one or two weeks.
OK. Please also commit latest version of your code and scripts (e.g. for launching LM) to the repository.
http://panchenko.me/data/joint/ddt-adagram-ukwac+wacky-476k-closure-v3.csv.gz http://panchenko.me/data/joint/ddt-wiki-n200-380k-v3-closure.csv.gz http://panchenko.me/data/joint/ddt-wiki-n30-1400k-v3-closure.csv.gz http://panchenko.me/data/joint/ddt-news-n200-345k-closure.csv.gz http://panchenko.me/data/joint/ddt-news-n50-485k-closure.csv.gz
better filtering:
Hello,
I am back in Darmstadt, I haven't worked as much as I wanted to last week but so far from the TODO list, I have :
I haven't tried with other input files and there are the other centrality metrics left (although I think maybe it would take too much time to compute betweenness and closeness as they need all pairs shortest paths in the cluster)
But I don't have so many results yet because I get an OOM error on Frink (when using 25Gb of RAM), can I use more than this or can I get access to the other server with more RAM that you mentioned ?
Yes, you can use more memory. Try not to exceed 50Gb for long running jobs and 75Gb for shortrunning jobs. If you need more I can provide you access to another server.
Please keep me updated, the plan is OK. Please add new results (with new columns) to the Google doc spreadsheet, so I can quickly analyse the results. The main goal is to get some metrics that would filter big general clusters and keep only mid-sized tightly semantically connected ones.
I fixed the memory problem for the number of triangles and triplets and I am running CW+clustering coef with every input file. So far I have already 2 results that I uploaded on the google doc For example : CWddt-wiki-n30-1400k-v3-closureFiltBef2.txt means :
In the google doc, I put for each cluster : size of the cluster, number of triangles and of triplets and the clustering coefficient that is computed with them
I haven't included the eigenvector centrality nor any centrality metrics because as it is now, it takes too much memory (for example the matrix product for the eigenvector centrality) that's why what is on the Google doc is just raw clusters : the step before hypernyms.
I will also upload the files on Frink in the directory simondif/structured-topics/results
Hi, Great. Thanks for the updates. Why it takes much memory. Normally you only need to load one cluster a time which is not so big. Try more efficient implementations. These are extreamly efficient ones: https://graph-tool.skewed.de/ http://igraph.org/python/doc/tutorial/tutorial.html
Ok, actually I haven't tried my implementation yet but I guess it would take a lot of memory as it needs matrices whose row and column sizes are the cluster size and we have clusters with more than 10000 words. So far I try to focus more on the quality of clusters before the quality of hypernyms because it will be easier to handle hypernyms once we have good clusters I think. By the way, for the metrics, the two links refer to python modules, do you know if there is one in java ? I think it would be more convenient for me.
In the meantime, I am adding new results files of clustering on the google doc and Frink. Have you already thought about how to filter them ? I mean : should we apply some "mathematical" method or just try and see what filtering gives best results ? Like for example keep clusters whose sizeMin < size < sizeMax and clustercoef > clustercoefMin
eigenvalue centrality is not "must have" not, so indeed better not to involve ourselves into potentially heavy computations now.
So far I try to focus more on the quality of clusters before the quality of hypernyms because it will be easier to handle hypernyms once we have good clusters I think.
i have the following observation from your results. good clusters have good hypernyms i.e. hypernyms that are not too general and are semantically related. so, maybe actually hypernyms can be a great deal for information filtering as well
By the way, for the metrics, the two links refer to python modules, do you know if there is one in java ? I think it would be more convenient for me.
i am not aware of as efficient java modules. these two are hardcore C/C++ implementation that are very memory savvy.
In the meantime, I am adding new results files of clustering on the google doc and Frink. Have you already thought about how to filter them ? I mean : should we apply some "mathematical" method or just try and see what filtering gives best results ?
i will take a closer look next week and let you know. preferably we need to have some mathematically-grounded method that optimizes some meaningful cost function.
i have the following observation from your results. good clusters have good hypernyms i.e. hypernyms that are not too general and are semantically related. so, maybe actually hypernyms can be a great deal for information filtering as well
ok so I will compute hypernyms as well but now I am running CW and LM with all the files again, because I spotted a bug in my code that induced some modifications in the clusters.. it should take until tomorrow to have all the files again
i will take a closer look next week and let you know. preferably we need to have some mathematically-grounded method that optimizes some meaningful cost function.
I saw what you've done on the google doc with the rank rank2 columns, I noticed that in some cases the cluster. coef is good when the cluster is pretty bad. This happens when there are few connected triplets in the cluster so the cluster. coef value (= number of triangles / number of triplets) is high. Maybe we could use another coef. instead : instead of the number of triplets seen in the cluster, we could use the number of triplets that could be in the cluster, given its size. It would give us : cluster. coef. 2 = number of triangles / max number of triplets (or triangles) with max number of triplets = binomial coefficient (3,cluster size) I tried it on the google doc (in LMddt-wiki-n30-1400k-v3-closureFiltBef1FiltAft.txt), I think it should represent the cluster quality better than the normal clustering coef. (?)
regarding weighting schema (rank, rank2) this was just a very first modest attempt to rank the clusters. please feel free to add your own ranking formulas. Right now, they can be just ad-hoc combinations of factors, but later i would go for a machine learning model that would learn the coefficients.
i think two powerful factors may be:
it would be great if you can add these two factors by our next meeting.
Ok but these two factors don't work with ISAS, only with hypernyms from wordnet I guess ?
the first one works, the second is not
we need to have a next meeting next week and also a meeting with professor preferably before the end of the yesr. prepare the new results and let us meet next wenesday at 17:00.
ok no problem. the average depth is done, for the semantic similarity should I use this method http://arxiv.org/pdf/cmp-lg/9511007.pdf or just the path length between two words ?
you can actually just use the distributional thesaurus provided to you as input (it is a graph of word_i word_j similarity_ij). alternatively, yes, you can use wordnet based similarity measures. yes, the measure of Resnik is a good one based on WordNet. check this lib or other one: https://code.google.com/p/ws4j/
do you mean the ddt files I use to get the clusters ? if so, as I don't know the "sense number" for the hypernyms in this file should I try to find a relationship between each pair of senses for two hypernyms (hyp1sense1/hyp2sense1, hyp1sense1/hyp2sense2, hyp1sense2/hyp2sense2 etc.) ? but if the relationship is not direct, e.g. we have hyp1sense1-word3sense1 and word3sense1-hyp2sense2 then similarity(hyp1,hyp2)=similarity(hyp1,word3) x similarity(word3,hyp2) ?
yes. right, you do not know sense number, but you can just use similarity of words.
assume three hypernyms x,y, z of a cluster
then we find in the ddt sim_xy, sim_xz, sim_yz and calculate 0.33 * (sim_xy + sim_xz + sim_yz)
else you can get these sim_xy ... from wordnet
I implemented the semantic similarity for hypernyms as you said i.e. either coming from ddt files or from wordnet. I chose Wu&Palmer measure for wordnet because it's between 0 and 1 like the measure in the ddt files. But actually I think the "between 0 and 1" criterion is not good enough: as we will later compare measures from either ddt files or wordnet, the two measures should be the same or at least have the same frequency distribution, is it not a problem for uns ?
I don't think 0-1 is a problem, you can re-normalize at any time. I would also try JCN or Resnik instead, they normally give slightly better results.
But actually I think the "between 0 and 1" criterion is not good enough: as we will later compare measures from either ddt files or wordnet, the two measures should be the same or at least have the same frequency distribution, is it not a problem for uns ?
Not sure what do you mean here.
Can you pull the results to the google docs and try to sort the clusters right there using a combination of factors in the fashion I done with rank/rank2? In this way we will have more to discuss.
Ok I will try other measures.
Not sure what do you mean here.
If two measures (normalized between 0 and 1) don't have the same distribution, maybe for one measure 0.6 is good and for the second one 0.6 is bad. And when we will compare scores, we will mix both measures so our analysis could be a bit wrong because of this.
Motivation
We need to develop a prototype of the system that builds structured topics model and is able to label new texts according to these topics. This vertical prototype is supposed to have all minimal functionality of the system (input/output) and implemented with the most straightforward set of algorithms. The goal is to make initial validation of the idea and then improve the prototype gradually. In this step we do no evaluation which will be done later as well (during the "official" 6 month period reserved for writing the thesis).
An important point also is to make a preliminary evaluation of the prototype (to show that the quality can be measurable).
Implementation
The prototype will build structured topics out of sense similarity graphs. These graphs were built automatically using distributional semantics methods (http://maggie.lt.informatik.tu-darmstadt.de/jobimtext/documentation/distributional-semantics/).
The overall pipeline of the prototype (to be implemented in Java/Scala or a mix of both):
Download the data -- a Disambiguated Distributional Thesaurus (DDT) build from the JoBimText and AdaGram models:
Frequency dictionary: http://panchenko.me/data/joint/word-freq-news.gz -- to filter the graph.
The data have the format
word cid prob cluster isas
Cluster the graphs of sense similarities using Chinese Whisper (CW), Markov Chain Clustering (MCL) and Louvain Method (LM).
For the first two algorithms use this implementation: https://github.com/johannessimon/chinese- whispers. Alternatively you can use this implementation: http://maggie.lt.informatik.tu-darmstadt.de/jobimtext/documentation/sense-clustering/ . Use: CW: https://github.com/johannessimon/chinese-whispers/blob/master/src/main/java/de/tudarmstadt/lt/cw/CW.java
MCL: https://github.com/johannessimon/chinese-whispers/blob/master/src/main/java/net/sf/javaml/clustering/mcl/MarkovClustering.java
For the LM use any available implementation e.g. https://perso.uclouvain.be/vincent.blondel/research/louvain.html.
Description of CW is available here: http://wortschatz.uni-leipzig.de/~cbiemann/pub/2006/BiemannTextGraph06.pdf
The output of clustering shall look like this:
To make each topics more readable, assign 3 frequent hypernyms to the senses (
topic-labels
). The set of hypernyms will be provided.In addition, for each topic label, find URL of the image that depicts it from DBpedia (for instance http://dbpedia.org/page/Berlin). The images are located in the field:
topic-label-image-urls
. Each word in case of ambiguity (http://dbpedia.org/page/Python) should be disambiguated. The output shall look like this:5 . Make a basic classification module that would use the structured topics, being clusters of senses, to annotate text documents. The module should
load the structured topics
each output topic should have a confidence of the classification
To implement this module you should use ElasticSearch index. One topic would be one document, and then use an input document as search query. The retrieval system will return a list of documents (topics) according to their TF-IDF score.
How scoring of ElasticSearch works: