juba / rainette

R implementation of the Reinert text clustering method
https://juba.github.io/rainette/
55 stars 7 forks source link

Select clusters containing more than a certain number of segments #31

Open gabrielparriaux opened 10 months ago

gabrielparriaux commented 10 months ago

Hi @juba,

After having done a Rainette clustering, I often execute a Correspondence Analysis with lexicon and clusters.

In that case, very small clusters tend to pull the plot to the extremes, making it difficult to read.

So I’m looking for a way to select and isolate the clusters that contain a very small number of segments.

In some way, I need to build a vector with the names of the clusters that contain less than a certain number of segments.

I have looked at the documentation available but have no idea of how to do it.

Can you help me and put me on the way?

Thanks a lot for your help!

Gabriel

juba commented 10 months ago

This is not directly related to rainette. You have to compute the size of the clusters and filter out the smaller ones. Something like:

tab <- table(clusters)
names(tab)[tab > min_size]
gabrielparriaux commented 10 months ago

Thanks a lot for your help and sorry to ask a question not directly related to rainette… 😰

Just a question: is there an object “clusters” that I can use to compute the size of each cluster? I can’t find it in the docs.

I had an idea of computing the size of each cluster with something like this:

clusters <- clusters_by_doc_table(dtm_for_analysis, clust_var = "Cluster")
sum(clusters$clust_1)
…

But, then I should have a loop to do it for each cluster in the clustering… and I think maybe there is something simpler?

Sorry if it’s an obvious question…

juba commented 10 months ago

If you're looking for the size of each cluster in terms of number of segments, then doing the following should be enough:

clusters <- cutree(res, k = 5)
table(clusters)

Or in your example:

table(dtm_for_analysis$Cluster)
gabrielparriaux commented 10 months ago

Much easier like this 😬. Thanks a lot for helping, this is exactly what I needed!

fredericln commented 2 weeks ago

Hello, thanks @juba for the clarifications in this thread and other ones.

I just tried rainette for a few hours on 2 corpora (one from social networks, on health problems; the other from an RPS survey in a large company), and It's a pleasure for users, both because it secures the ability to use Reinert method (Iramuteq updates were quite expected, excuse me Pierre R.!) and because rainette outputs are normal, reusable R objects.

Yet, @gabrielparriaux 's and others' point make sense to me. On both corpora, around 2/3 of classes were very tiny: I could, for sure, filter these outlier classes afterwards, but as far as I understand, a) I cannot use rainette_explor on the filtered list (so the explor screen contents are 2/3 garbage), b) it will imply to renumber for classes, the correspondence is missed.

Imho, it would be great:

Just my two cents; I am (well, I became, with age) an absolute layman in software development, and other users may feel the issue non-existent.

juba commented 1 week ago

Hi,

Thanks for your observations, and I think they are totally legit. Unfortunately I don't work on textual analysis anymore, and so I'm on a low maintenance mode on rainette currently. But I'll definitely try to try to implement your suggestions if I find some time in the future...