smndf / structured-topics

Apache License 2.0
1 stars 0 forks source link

Evaluation of the initial results #2

Closed alexanderpanchenko closed 8 years ago

alexanderpanchenko commented 8 years ago

Motivation

Evaluate the first result so you are able to write the first report.

Implementation

  1. Select 6 best configurations from the table:
    • LM + adagram
    • LM + ddt-wiki
    • LM + ddt-news
    • CW + adagram
    • CW + ddt-wiki
    • CW + ddt-news
  2. For each of these 6 clusterings add additional column that would rank the cluster according to their quality. Introduce an ad-hoc ranking e.g. average-depth-of-hypernyms*average-simialrity-of-hypernyms. Add this as an excel formula and rank clusters according to this formula.
  3. Add additional column "Interpretable" for each of these 6 tables.
  4. Fill the column for each row with 1 if the cluster is "interpretable" i.e. a list of cities, a list of drugs, a list of dinosaurs. Otherwise for uninterpretable clusters write 0. All rows of all 6 sheets shall be annotated.
  5. Draw the plot of Precision@k: x is the number of relevant clusters among first k clusters; y is k i.e. the number of considered clusters. See http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html
  6. Post the plots here by 15 of December 2015, the earlier the better.
smndf commented 8 years ago

Ok I guess it will give a better overview with these 6 best configurations than with 2 or 3. Regarding the step 2, I understood we would first select the best clusters according to this ranking before deciding which ones are interpretable. But actually it shouldn't be much work overload since once sorted, most bad clusters will likely get a 0. So I think it's fine, I will do this as soon as possible.

alexanderpanchenko commented 8 years ago

Ok I guess it will give a better overview with these 6 best configurations than with 2 or 3.

right, and each is about 100-300 cluster so it shouldn't be too much work. also this indeed gives a better idea about different parameters. you can take any wiki/news ddt here.

Regarding the step 2, I understood we would first select the best clusters according to this ranking before deciding which ones are interpretable. But actually it shouldn't be much work overload since once sorted, most bad clusters will likely get a 0.

right, ranking helps, but please mind that it is very important to inspect each cluster to get correct numbers

smndf commented 8 years ago

I couldn't work before today, so so far I processed two files, I am doing the third one but I just noticed that there are some words with the first letter missing, I checked and they are in the original file as well (ddt-adagram-ukwac+wacky-476k-closure-v3.csv).

Like this : 311844:ontrol#NOUN 1 0.997 dvanced#VERB#1:0.861742,omputer#NOUN#1:0.860782,ystem#NOUN#1:0.857831,anguage#NOUN#1:0.85202,ight#NOUN#1:0.835521,echnology#NOUN#1:0.833527,igital#ADJ#1:0.826269,perating#VERB#1:0.825774,esearch#NOUN#1:0.821001...

When I checked the original file, I also noticed some words with a 'a' added at the end : 475453:zonea#NOUN 1 0.999 grounda#NOUN#1:0.623979,flighta#NOUN#1:0.613561,statea#NOUN#1:0.59931,actiona#NOUN#1:0.590598,conflicta#N...

but there seems to be no words with a 'b' added for instance

alexanderpanchenko commented 8 years ago

thanks for pointing out. just proceed with these errors. check also if the versions with first letter are present e.g. "control"

smndf commented 8 years ago

Yes the normal words are there as well (and more often).

smndf commented 8 years ago

I am tagging the clusters and for most of them, I don't get if it is interpretable or not (because I don't know this topic enough e.g. baseball players... ) So what I do is that I use the LookUp function of Mac OS (three fingers on a word to get entry from either dictionary or wikipedia). It helps and performs well, for example I just found a topic that is a list of departments in Burkina Faso... So I thought, maybe we could use it (well, not the LookUp for Apple itself but something similar) : If for each word in a topic, we get the abstract of its Wikipedia article (or wiktionary ?) and then we compute the most frequent (excluding stopwords) words, I think we would get better description of a cluster than we have with wordnet and isas. And then we could also evaluate clusters quality by computing average semantic similarity between each abstract. The main advantage is that it would cover much more words than wordnet and isas do.

alexanderpanchenko commented 8 years ago

am tagging the clusters and for most of them, I don't get if it is interpretable or not (because I don't know this topic enough e.g. baseball players... ) So what I do is that I use the LookUp function of Mac OS (three fingers on a word to get entry from either dictionary or wikipedia).

Precisely. That is why i suggest to implement a simple visualization interface. it would actually would help to annotate the clusters. each node would refer to google search (wikipedia doesn't contain many rare stuff)

So I thought, maybe we could use it (well, not the LookUp for Apple itself but something similar) :

Agree. That was a part of the plan in this way or another :-) My current idea is to visualize the cluster with a graph. Each word has a picture and is clickable (you can see definition and/or go to google results).

Specification for this visualization is still ToDo: please focus on annotation first and try to complete it asap, so you can focus on HTML + JavaScript stuff later this month.

smndf commented 8 years ago

ah ok :+1:

smndf commented 8 years ago

By the way, since the beginning I was filtering ddt files keeping nouns, adjectives and verbs. I tried with nouns only (NN NP) and it seems to give better results See CWddt-adagramNounsOnlyFiltBef2FiltAftWithHyp.csv file in the google doc or on Frink /structured-topics/data

alexanderpanchenko commented 8 years ago

you mean better is sense of clustering? like less noisy clusters?

smndf commented 8 years ago

yes, maybe because removing verbs and adjectives removes links and that helps to better cluster the graph

smndf commented 8 years ago

For the clusters for news and wiki files, there are more clusters than for adagram : 500, 900, 6000... Should I tag only part of them ?

alexanderpanchenko commented 8 years ago

can you post here precise correspondance b/w number of clusters and the model?

LM + adagram LM + ddt-wiki LM + ddt-news CW + adagram CW + ddt-wiki CW + ddt-news

smndf commented 8 years ago

LM+adagram 150 LM+news50 950 LM+news200 550 LM+wiki30 10050 LM+wiki200 850

CW+adagram 100 CW+news50 850 CW+news200 450 CW+wiki30 6000 CW+wiki200 350

alexanderpanchenko commented 8 years ago

annotate these fully (do in the first place): LM+adagram 150 LM+news200 550 LM+wiki200 850 CW+adagram 100 CW+news200 450 CW+wiki200 350

from these just pick good like 50 good clusters (do in the second place): LM+news50 950 CW+news50 850 CW+wiki30 6000 LM+wiki30 10050

smndf commented 8 years ago

Ok so far I annotated LM+adagram CW+adagram and a second CW+adagram with Nouns only, clusters seem better with nouns only, so for the others I filter to keep only nouns ?

alexanderpanchenko commented 8 years ago

yes, for this experiment you can keep nouns only. but please be consistent. either annotate only with nouns or everything

smndf commented 8 years ago

I finished the first 6 ones (to do in the first place)

alexanderpanchenko commented 8 years ago

please share the results

smndf commented 8 years ago

They are on the Google doc, the column is just before the cluster's column The first sheet can be used to try a new formula in order to find a good one (with great area under the precision curve), I tried to add a script but it's not working yet

alexanderpanchenko commented 8 years ago

to make it more clean for the presention can you please create a separate spreadsheet with only these 6 results?

please arrange the results like this (possibly adding extra information in the name):

LM + adagram LM + ddt-wiki LM + ddt-news CW + adagram CW + ddt-wiki CW + ddt-news

smndf commented 8 years ago

All the results are on the google doc (to do in the first and in the second place) with the precision curves

alexanderpanchenko commented 8 years ago

1) add X axis to the plots

2) i added comments to google doc -- please revise

3) add extra column "Keyword" and for each cluster with Interpretable == 1 write to this column a keyword that characterizes the cluster e.g. "names", "surnames" or "drugs". this will let us better interpret the results and understand how these clusters can be used in an nlp applications.

smndf commented 8 years ago

I have done 1) and 2) I will do 3), but you should have told me to do it before, I will have to look at all clusters one by one, that I've already done once :-/

alexanderpanchenko commented 8 years ago

regarding 3), sorry about it. i think it would help us to understand what kind of data we are dealing with. how much time it will take you?

smndf commented 8 years ago

I don't know, several days of full time work I guess. But the thing is also that it is quite repetitive and boring as there are about 3000 clusters to annotate.

alexanderpanchenko commented 8 years ago

OK. If it takes you several days just drop this one.

Can you at least annotate one model (like 300 clusters). Maybe will take you one hour or so, right?

smndf commented 8 years ago

yes ok

alexanderpanchenko commented 8 years ago

I would like to cancel the task of visualization (before Christmas) we discussed earlier. Instead it would be really great if you can add to all of your models two extra columns with new hypernyms:

These hypernyms are according to my experience are much better than those I provided you before, so the labeling of the cluster shall look much better.

For each cluster find all hypernyms and save 50 most frequent hypernyms from these files. Please try two weighting strategies for each file:

  1. number of hypernyms: one cluster-word--hypernym pair contributes adds 1 to the score of the cluster hypernym
  2. weight of hypernyms: one cluster-word--hypernym pair contributes adds weight to the score of the cluster hypernym.

I assume that you already have a program that does this, so shall be easy. Note that the second file is really huge, so you can only do it on the server.

The final result of this task would be extra 4 columns (2 models * 2 weighting strategies) in each of the 6 clusterings.

Can you release the 6 clusterings listed in your table with these four additional columns by the 2nd of January (the earlier the better)?

smndf commented 8 years ago

ok just one remark, I took a look at the files and often there are several words instead of just one e.g. "richard wright writer 3" Should I split them and check on each word for a match in the cluster ? i.e. maybe modify the files to generate several lines of one word instead of one line of several words : "richard writer 3" and "wright writer 3" it will induce non sense lines e.g. "catalonian meteorological service institution 1" would give "catalonian institution 1" which is not really relevant but if I keep the multiple words on a line we will miss some isas relations

alexanderpanchenko commented 8 years ago

ust one remark, I took a look at the files and often there are several words instead of just one e.g. "richard wright writer 3"

no, just check single words

but if I keep the multiple words on a line we will miss some isas relations

no problem

smndf commented 8 years ago

Hello! Happy new year! And sorry for no update... since last year :-)

First of all I am back in Darmstadt, is the meeting tomorrow still up?

Then, regarding the task you gave me: I uploaded some results on the google doc for the patternsim-isas file and will do the same for the commoncrawl-isas file as well today.

For patternsim, it is not complete (I have results for the 6 clusterings but for each of them, some values are missing). When I ran the programs on Frink they all stopped at the same time I don't know why and I preferred to run the program with commoncrawl file instead of rerunning patternsim.

For commoncrawl, the results are not complete either, that's because the programs for the 6 clusterings are still running on Frink. It is actually very slow because for each word the whole is read line by line ; I thought about sorting the lines alphabetically and then indexing new line bytes in a map but I am not sure it would be a great gain of time, what do you think?

Well, even if the results are not complete, I guess they still provide some insight about isas relevance. I think maybe 50 is too much and in most cases taking 10 best isas would be sufficient. Between patternsim and commoncrawl runs, I added number of occurences in the output, but in both cases isas are NOT sorted in the output (like most frequent isas -> least frequent) :-/. (for the weight strategy, number of occurences is the sum of weights)

alexanderpanchenko commented 8 years ago

Happy New Year!

I would prefer to move our meeting by one week. I have a very strict deadline on Friday and try to dedicate as much time as possible to it (the same time, the same place, but one week later).

Meanwhile, please send me:

all these is needed for your meeting with professor, i will fix the date tomorrow

alexanderpanchenko commented 8 years ago

Well, even if the results are not complete, I guess they still provide some insight about isas relevance. I think maybe 50 is too much and in most cases taking 10 best isas would be sufficient.

agree, if this makes things faster esp.

alexanderpanchenko commented 8 years ago

It is actually very slow because for each word the whole is read line by line ; I thought about sorting the lines alphabetically and then indexing new line bytes in a map but I am not sure it would be a great gain of time, what do you think?

how i would do it is to index/store in a hash table or smth like this

smndf commented 8 years ago

It is actually very slow because for each word the whole is read line by line ; I thought about sorting the lines alphabetically and then indexing new line bytes in a map but I am not sure it would be a great gain of time, what do you think?

how i would do it is to index/store in a hash table or smth like this

actually I found an easy way to do it this afternoon: sort the file and simply split it to generate one file per letter, it should be like 20times faster

alexanderpanchenko commented 8 years ago

great. please try now to prepare all materials by next meeting so as you need to present them to the professor (everything shall be ready). we will make a demo talk. put in front the numbers, the main results based on your evaluation + add description of the methods as well

smndf commented 8 years ago

ah ok so meeting tuesday next week only the two of us and another meeting with Pr Biemann after that ? yes I will work on this.

alexanderpanchenko commented 8 years ago

next tuesday is only two of us.

smndf commented 8 years ago

add description of the methods as well

you mean the java code?

alexanderpanchenko commented 8 years ago

add description of the methods as well

yes, just a short description in the presentation of the overall process: clustering, ranking clusters, assigning hypernyms

alexanderpanchenko commented 8 years ago

Just to confirm: tomorrow at 17:00 we meet in my office.

smndf commented 8 years ago

ok