Select clusters containing more than a certain number of segments

gabrielparriaux commented 10 months ago

Hi @juba,

After having done a Rainette clustering, I often execute a Correspondence Analysis with lexicon and clusters.

In that case, very small clusters tend to pull the plot to the extremes, making it difficult to read.

So I’m looking for a way to select and isolate the clusters that contain a very small number of segments.

In some way, I need to build a vector with the names of the clusters that contain less than a certain number of segments.

I have looked at the documentation available but have no idea of how to do it.

Can you help me and put me on the way?

Thanks a lot for your help!

Gabriel

juba commented 10 months ago

This is not directly related to rainette. You have to compute the size of the clusters and filter out the smaller ones. Something like:

tab <- table(clusters)
names(tab)[tab > min_size]

gabrielparriaux commented 10 months ago

Thanks a lot for your help and sorry to ask a question not directly related to rainette… 😰

Just a question: is there an object “clusters” that I can use to compute the size of each cluster? I can’t find it in the docs.

I had an idea of computing the size of each cluster with something like this:

clusters <- clusters_by_doc_table(dtm_for_analysis, clust_var = "Cluster")
sum(clusters$clust_1)
…

But, then I should have a loop to do it for each cluster in the clustering… and I think maybe there is something simpler?

Sorry if it’s an obvious question…

juba commented 10 months ago

If you're looking for the size of each cluster in terms of number of segments, then doing the following should be enough:

clusters <- cutree(res, k = 5)
table(clusters)

Or in your example:

table(dtm_for_analysis$Cluster)

gabrielparriaux commented 10 months ago

Much easier like this 😬. Thanks a lot for helping, this is exactly what I needed!

fredericln commented 3 weeks ago

Hello, thanks @juba for the clarifications in this thread and other ones.

I just tried rainette for a few hours on 2 corpora (one from social networks, on health problems; the other from an RPS survey in a large company), and It's a pleasure for users, both because it secures the ability to use Reinert method (Iramuteq updates were quite expected, excuse me Pierre R.!) and because rainette outputs are normal, reusable R objects.

Yet, @gabrielparriaux 's and others' point make sense to me. On both corpora, around 2/3 of classes were very tiny: I could, for sure, filter these outlier classes afterwards, but as far as I understand, a) I cannot use rainette_explor on the filtered list (so the explor screen contents are 2/3 garbage), b) it will imply to renumber for classes, the correspondence is missed.

Imho, it would be great:

either to allow the user to determine a minimal size of clusters (let's admit all clusters below that limit will be grouped in a class 0, Iramuteq-style) - then, the user cannot set the number k of clusters but only an upper boundary;
or to allow the user, e.g. within rainette_explor to "throw to the bin" (i.e. class 0) some clusters, and redraw the contents accordingly.

Just my two cents; I am (well, I became, with age) an absolute layman in software development, and other users may feel the issue non-existent.

juba commented 2 weeks ago

Hi,

Thanks for your observations, and I think they are totally legit. Unfortunately I don't work on textual analysis anymore, and so I'm on a low maintenance mode on rainette currently. But I'll definitely try to try to implement your suggestions if I find some time in the future...

juba / rainette

Select clusters containing more than a certain number of segments #31