seunginah / semanticvectors

Automatically exported from code.google.com/p/semanticvectors
0 stars 0 forks source link

Clustering vectors is missing option for restricting number of results #41

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
ClusterVectorStore does not have option or flag for restricting number of 
results similar to  java pitt.search.semanticvectors.ClusterResults 
-numclusters 2 -numsearchresults 30 pharaoh

For large datasets it can be tedious to look at large clusters with hundreds of 
elements (documents or terms) and it would be nice to look at clusters with a 
few top items for each cluster. 

I have added the option and modified these files that I am attaching. Please 
review and if acceptable, please let me know if I can commit the changes. 

The new option is clustersize. 
This is in connection to my question at: 
http://groups.google.com/group/semanticvectors/browse_thread/thread/c8e2096334a3
4962

But, I am not sure if this is the right way to do something like that. 

Original issue reported on code.google.com by pmj005@gmail.com on 5 Jul 2011 at 7:57

Attachments:

GoogleCodeExporter commented 9 years ago
So the idea is to use the flag -clustersize to limit the number of elements of 
each cluster that are printed to stdout? That makes plenty of sense.

I wonder if it could confuse people into thinking that this is a limit on the 
clusters themselves, rather than the printed output. But I'm having trouble 
thinking of a clearer name that isn't very long winded. I can't imagine anyone 
using "-maxclusterelementsprinted" in a hurry! If you can think of a name 
somewhere in between that might be ideal.

We could add a trailing print saying " ... and XX more." to give the reader a 
sense of how many elements are missing.

If you want developer permissions then you can add these modifications yourself 
and that would be great. Otherwise I'll gladly do it for you. Thanks for the 
initiative.

-Dominic

Original comment by widd...@google.com on 6 Jul 2011 at 1:45

GoogleCodeExporter commented 9 years ago
Dominic, 
Yes, it would be nice if you can grant me developer permission and I can 
directly commit such changes. 
Yes, I have added an optional flag called -clustersize to limit. Ideally, I 
wanted to set max number of elements in each cluster to be output and then 
supposedly the best or most representative elements upto the max count of each 
cluster would be printed. But, I am not that familiar with the code and so what 
I have done is an intermediate goal. This would set max no. of elements in all 
clusters combined. Depending on data spread some clusters may have few elements 
and others more but total would not exceed the -clustersize value. 
I also struggled a little with name choice but thought this was least bad 
option! 

Original comment by pmj005@gmail.com on 6 Jul 2011 at 7:31

GoogleCodeExporter commented 9 years ago
OK, this looks good and I've granted developer permissions.

-Dominic

Original comment by widd...@google.com on 6 Jul 2011 at 7:35

GoogleCodeExporter commented 9 years ago
Fixed and committed my code.

Original comment by pmj005@gmail.com on 11 Jul 2011 at 5:33

GoogleCodeExporter commented 9 years ago

Original comment by pmj005@gmail.com on 11 Jul 2011 at 5:33