sunandap / airhead-research

Automatically exported from code.google.com/p/airhead-research
0 stars 0 forks source link

tree-like clustering #85

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
from issue http://code.google.com/p/airhead-research/issues/detail?id=55

WordComparator is what you get when you use "get-neighbors .." from the 
Semantic Space interpreter. 
Is it possible to use a tree-like clustering algorithm(like weka hierarchical 
clustering package) on the semantic space.

Original issue reported on code.google.com by alain.dh...@gmail.com on 7 Mar 2011 at 1:43

GoogleCodeExporter commented 9 years ago
Hi,

We currently have support for hierarchical clustering (lie WEKA does) within 
the S-Space Package, so clustering the neighbors should be possible.  However, 
I don't have a clear sense of what kind of output you're looking for.  Could 
you describe a bit more what information you're hoping to see and if possible, 
sketch a toy example of the output you'd like to see?  I think we can implement 
this, but I'd like to make sure we're implementing what you're asking for.

Original comment by David.Ju...@gmail.com on 7 Mar 2011 at 6:13

GoogleCodeExporter commented 9 years ago
The idea is trying to find "concepts" through clusters of words which are 
closer in semantic sense (it is not a new idea also maybe not a very good one 
but I ' like to see the results !).
Hierarchical clustering seems to be a simple solution just because you already 
have a function to get the first neighbor of a word.
Because Hierarchical clustering is a tree, the algorithm has to cut the tree at 
a certain level using a specific metric to identify the more interesting 
agglomerative clusters.

Ex (very simple):

Words: Jules, Verne, french, author, restaurant, tower, Effeil

           |--------------/cut/---------------|
     |------------|                |------------------|
  |-----|       |------|         |-----|              |
french author  Jules Verne     Effeil tower       restaurant

Concepts are : "Jules Verne french author", "restaurant at Effeil tower (le 
Jules Verne)"

Original comment by alain.dh...@gmail.com on 8 Mar 2011 at 9:41

GoogleCodeExporter commented 9 years ago
Ok, I think I see what you're getting at.  So you would like to have the 
k-nearest-neighbors clustered and then show the dendrogram for those neighbors?

Cutting the dendrogram is a bit trickier since there's a *large* number of ways 
to do this, e.g. min-similarity, number of merges, edge similarities, etc.  I 
think it may be easiest to just report the full dendrogram and somehow encode 
the similarities in the output so the user can see what the data looks like. 

So would something like this work:

> cluster-neighbors Jules 6
            /- french
     (.71) /\- author
          /
   (.60) /\- Verne
        /
Jules - \
         \/- restaurant
    (.42) \
           \/- Eiffel
      (.83) \- tower

I think the vertical alignment might be easier to mechanically reproduce and 
also allow us to fit more neighbors on a screen (imagine doing 30 neighbors 
horizontally :)  )

Does this match what you had in mind?

Original comment by David.Ju...@gmail.com on 8 Mar 2011 at 7:16

GoogleCodeExporter commented 9 years ago
Well if you can also draw ascii art ;-)

I 'm not sure that we have the same idea, you speak about knn algorithm and 
then print the result using a tree (2 branches):

You have to precise one word and number of clusters

>knn Jules 6

  ------------------------------------------------                    
|                       Jules                     |
|  -------------------     ---------------------- |
| |         Verne     |   |       restaurant     || 
| |   --------------- |   |     ---------------  ||
| |  | french author ||   |    | Effeil tower  | ||
| |   --------------- |   |     ---------------  || 
|  -------------------     ---------------------- |
 -------------------------------------------------

Is that correct ?

In my mind I use directly dendogramm that is I search for each word the nearest 
neighbor and then iterate on cluster

I don't precise word and number of clusters

>dendogramm

1 iteration (Jules-Verne) (French author) (Effeil tower) (Restaurant)
2 iteration ((Jules-Verne)-(French author)) ((Effeil tower)-(restaurant))
3 iteration (((Jules-Verne)-(French author)) ((Effeil tower)-(restaurant)))

for the 1 iteration we simple use your neighbor function between two words but 
for the next iterations we have to compute distance between two clusters of 
words ...

Using your representation:

                  /-Jules
                 /\-Vernes
             /  /
            /\- \/-French
           /     \-Author
          /
         /  
         \
          \   
           \    /-Effeil
            \/-/\-tower
             \ \
                \/-restaurant
                 \- 

Original comment by alain.dh...@gmail.com on 9 Mar 2011 at 10:32

GoogleCodeExporter commented 9 years ago
May be to extract only a part of the tree we can use parameters:

>dendogram 3 jules Verne

Which means:
 Get the part of the tree from leaves "jules", "Verne" using 3 node level ancestors

or 
>dendogram jules Verne restaurant

Which means:
 Get the part of the tree containing leaves "jules", "Verne" and "restaurant"

And then after if we click on one node we can see all relevant documents etc. 

Original comment by alain.dh...@gmail.com on 9 Mar 2011 at 10:58

GoogleCodeExporter commented 9 years ago
Oups, you are right it is "Eiffel" and not "Effeil" ! (sorry Gustave ...)

Original comment by alain.dh...@gmail.com on 9 Mar 2011 at 1:37