Closed alexanderpanchenko closed 8 years ago
Ok I guess it will give a better overview with these 6 best configurations than with 2 or 3. Regarding the step 2, I understood we would first select the best clusters according to this ranking before deciding which ones are interpretable. But actually it shouldn't be much work overload since once sorted, most bad clusters will likely get a 0. So I think it's fine, I will do this as soon as possible.
Ok I guess it will give a better overview with these 6 best configurations than with 2 or 3.
right, and each is about 100-300 cluster so it shouldn't be too much work. also this indeed gives a better idea about different parameters. you can take any wiki/news ddt here.
Regarding the step 2, I understood we would first select the best clusters according to this ranking before deciding which ones are interpretable. But actually it shouldn't be much work overload since once sorted, most bad clusters will likely get a 0.
right, ranking helps, but please mind that it is very important to inspect each cluster to get correct numbers
I couldn't work before today, so so far I processed two files, I am doing the third one but I just noticed that there are some words with the first letter missing, I checked and they are in the original file as well (ddt-adagram-ukwac+wacky-476k-closure-v3.csv).
Like this : 311844:ontrol#NOUN 1 0.997 dvanced#VERB#1:0.861742,omputer#NOUN#1:0.860782,ystem#NOUN#1:0.857831,anguage#NOUN#1:0.85202,ight#NOUN#1:0.835521,echnology#NOUN#1:0.833527,igital#ADJ#1:0.826269,perating#VERB#1:0.825774,esearch#NOUN#1:0.821001...
When I checked the original file, I also noticed some words with a 'a' added at the end : 475453:zonea#NOUN 1 0.999 grounda#NOUN#1:0.623979,flighta#NOUN#1:0.613561,statea#NOUN#1:0.59931,actiona#NOUN#1:0.590598,conflicta#N...
but there seems to be no words with a 'b' added for instance
thanks for pointing out. just proceed with these errors. check also if the versions with first letter are present e.g. "control"
Yes the normal words are there as well (and more often).
I am tagging the clusters and for most of them, I don't get if it is interpretable or not (because I don't know this topic enough e.g. baseball players... ) So what I do is that I use the LookUp function of Mac OS (three fingers on a word to get entry from either dictionary or wikipedia). It helps and performs well, for example I just found a topic that is a list of departments in Burkina Faso... So I thought, maybe we could use it (well, not the LookUp for Apple itself but something similar) : If for each word in a topic, we get the abstract of its Wikipedia article (or wiktionary ?) and then we compute the most frequent (excluding stopwords) words, I think we would get better description of a cluster than we have with wordnet and isas. And then we could also evaluate clusters quality by computing average semantic similarity between each abstract. The main advantage is that it would cover much more words than wordnet and isas do.
am tagging the clusters and for most of them, I don't get if it is interpretable or not (because I don't know this topic enough e.g. baseball players... ) So what I do is that I use the LookUp function of Mac OS (three fingers on a word to get entry from either dictionary or wikipedia).
Precisely. That is why i suggest to implement a simple visualization interface. it would actually would help to annotate the clusters. each node would refer to google search (wikipedia doesn't contain many rare stuff)
So I thought, maybe we could use it (well, not the LookUp for Apple itself but something similar) :
Agree. That was a part of the plan in this way or another :-) My current idea is to visualize the cluster with a graph. Each word has a picture and is clickable (you can see definition and/or go to google results).
Specification for this visualization is still ToDo: please focus on annotation first and try to complete it asap, so you can focus on HTML + JavaScript stuff later this month.
ah ok :+1:
By the way, since the beginning I was filtering ddt files keeping nouns, adjectives and verbs. I tried with nouns only (NN NP) and it seems to give better results See CWddt-adagramNounsOnlyFiltBef2FiltAftWithHyp.csv file in the google doc or on Frink /structured-topics/data
you mean better is sense of clustering? like less noisy clusters?
yes, maybe because removing verbs and adjectives removes links and that helps to better cluster the graph
For the clusters for news and wiki files, there are more clusters than for adagram : 500, 900, 6000... Should I tag only part of them ?
can you post here precise correspondance b/w number of clusters and the model?
LM + adagram LM + ddt-wiki LM + ddt-news CW + adagram CW + ddt-wiki CW + ddt-news
LM+adagram 150 LM+news50 950 LM+news200 550 LM+wiki30 10050 LM+wiki200 850
CW+adagram 100 CW+news50 850 CW+news200 450 CW+wiki30 6000 CW+wiki200 350
annotate these fully (do in the first place): LM+adagram 150 LM+news200 550 LM+wiki200 850 CW+adagram 100 CW+news200 450 CW+wiki200 350
from these just pick good like 50 good clusters (do in the second place): LM+news50 950 CW+news50 850 CW+wiki30 6000 LM+wiki30 10050
Ok so far I annotated LM+adagram CW+adagram and a second CW+adagram with Nouns only, clusters seem better with nouns only, so for the others I filter to keep only nouns ?
yes, for this experiment you can keep nouns only. but please be consistent. either annotate only with nouns or everything
I finished the first 6 ones (to do in the first place)
please share the results
They are on the Google doc, the column is just before the cluster's column The first sheet can be used to try a new formula in order to find a good one (with great area under the precision curve), I tried to add a script but it's not working yet
to make it more clean for the presention can you please create a separate spreadsheet with only these 6 results?
please arrange the results like this (possibly adding extra information in the name):
LM + adagram LM + ddt-wiki LM + ddt-news CW + adagram CW + ddt-wiki CW + ddt-news
All the results are on the google doc (to do in the first and in the second place) with the precision curves
1) add X axis to the plots
2) i added comments to google doc -- please revise
3) add extra column "Keyword" and for each cluster with Interpretable == 1 write to this column a keyword that characterizes the cluster e.g. "names", "surnames" or "drugs". this will let us better interpret the results and understand how these clusters can be used in an nlp applications.
I have done 1) and 2) I will do 3), but you should have told me to do it before, I will have to look at all clusters one by one, that I've already done once :-/
regarding 3), sorry about it. i think it would help us to understand what kind of data we are dealing with. how much time it will take you?
I don't know, several days of full time work I guess. But the thing is also that it is quite repetitive and boring as there are about 3000 clusters to annotate.
OK. If it takes you several days just drop this one.
Can you at least annotate one model (like 300 clusters). Maybe will take you one hour or so, right?
yes ok
I would like to cancel the task of visualization (before Christmas) we discussed earlier. Instead it would be really great if you can add to all of your models two extra columns with new hypernyms:
These hypernyms are according to my experience are much better than those I provided you before, so the labeling of the cluster shall look much better.
For each cluster find all hypernyms and save 50 most frequent hypernyms from these files. Please try two weighting strategies for each file:
I assume that you already have a program that does this, so shall be easy. Note that the second file is really huge, so you can only do it on the server.
The final result of this task would be extra 4 columns (2 models * 2 weighting strategies) in each of the 6 clusterings.
Can you release the 6 clusterings listed in your table with these four additional columns by the 2nd of January (the earlier the better)?
ok just one remark, I took a look at the files and often there are several words instead of just one e.g. "richard wright writer 3" Should I split them and check on each word for a match in the cluster ? i.e. maybe modify the files to generate several lines of one word instead of one line of several words : "richard writer 3" and "wright writer 3" it will induce non sense lines e.g. "catalonian meteorological service institution 1" would give "catalonian institution 1" which is not really relevant but if I keep the multiple words on a line we will miss some isas relations
ust one remark, I took a look at the files and often there are several words instead of just one e.g. "richard wright writer 3"
no, just check single words
but if I keep the multiple words on a line we will miss some isas relations
no problem
Hello! Happy new year! And sorry for no update... since last year :-)
First of all I am back in Darmstadt, is the meeting tomorrow still up?
Then, regarding the task you gave me: I uploaded some results on the google doc for the patternsim-isas file and will do the same for the commoncrawl-isas file as well today.
For patternsim, it is not complete (I have results for the 6 clusterings but for each of them, some values are missing). When I ran the programs on Frink they all stopped at the same time I don't know why and I preferred to run the program with commoncrawl file instead of rerunning patternsim.
For commoncrawl, the results are not complete either, that's because the programs for the 6 clusterings are still running on Frink. It is actually very slow because for each word the whole is read line by line ; I thought about sorting the lines alphabetically and then indexing new line bytes in a map but I am not sure it would be a great gain of time, what do you think?
Well, even if the results are not complete, I guess they still provide some insight about isas relevance. I think maybe 50 is too much and in most cases taking 10 best isas would be sufficient. Between patternsim and commoncrawl runs, I added number of occurences in the output, but in both cases isas are NOT sorted in the output (like most frequent isas -> least frequent) :-/. (for the weight strategy, number of occurences is the sum of weights)
Happy New Year!
I would prefer to move our meeting by one week. I have a very strict deadline on Friday and try to dedicate as much time as possible to it (the same time, the same place, but one week later).
Meanwhile, please send me:
all these is needed for your meeting with professor, i will fix the date tomorrow
Well, even if the results are not complete, I guess they still provide some insight about isas relevance. I think maybe 50 is too much and in most cases taking 10 best isas would be sufficient.
agree, if this makes things faster esp.
It is actually very slow because for each word the whole is read line by line ; I thought about sorting the lines alphabetically and then indexing new line bytes in a map but I am not sure it would be a great gain of time, what do you think?
how i would do it is to index/store in a hash table or smth like this
It is actually very slow because for each word the whole is read line by line ; I thought about sorting the lines alphabetically and then indexing new line bytes in a map but I am not sure it would be a great gain of time, what do you think?
how i would do it is to index/store in a hash table or smth like this
actually I found an easy way to do it this afternoon: sort the file and simply split it to generate one file per letter, it should be like 20times faster
great. please try now to prepare all materials by next meeting so as you need to present them to the professor (everything shall be ready). we will make a demo talk. put in front the numbers, the main results based on your evaluation + add description of the methods as well
ah ok so meeting tuesday next week only the two of us and another meeting with Pr Biemann after that ? yes I will work on this.
next tuesday is only two of us.
add description of the methods as well
you mean the java code?
add description of the methods as well
yes, just a short description in the presentation of the overall process: clustering, ranking clusters, assigning hypernyms
Just to confirm: tomorrow at 17:00 we meet in my office.
ok
Motivation
Evaluate the first result so you are able to write the first report.
Implementation