juba / rainette

R implementation of the Reinert text clustering method
https://juba.github.io/rainette/
53 stars 7 forks source link

Select clusters containing more than a certain number of segments #31

Open gabrielparriaux opened 6 months ago

gabrielparriaux commented 6 months ago

Hi @juba,

After having done a Rainette clustering, I often execute a Correspondence Analysis with lexicon and clusters.

In that case, very small clusters tend to pull the plot to the extremes, making it difficult to read.

So I’m looking for a way to select and isolate the clusters that contain a very small number of segments.

In some way, I need to build a vector with the names of the clusters that contain less than a certain number of segments.

I have looked at the documentation available but have no idea of how to do it.

Can you help me and put me on the way?

Thanks a lot for your help!

Gabriel

juba commented 6 months ago

This is not directly related to rainette. You have to compute the size of the clusters and filter out the smaller ones. Something like:

tab <- table(clusters)
names(tab)[tab > min_size]
gabrielparriaux commented 6 months ago

Thanks a lot for your help and sorry to ask a question not directly related to rainette… 😰

Just a question: is there an object “clusters” that I can use to compute the size of each cluster? I can’t find it in the docs.

I had an idea of computing the size of each cluster with something like this:

clusters <- clusters_by_doc_table(dtm_for_analysis, clust_var = "Cluster")
sum(clusters$clust_1)
…

But, then I should have a loop to do it for each cluster in the clustering… and I think maybe there is something simpler?

Sorry if it’s an obvious question…

juba commented 6 months ago

If you're looking for the size of each cluster in terms of number of segments, then doing the following should be enough:

clusters <- cutree(res, k = 5)
table(clusters)

Or in your example:

table(dtm_for_analysis$Cluster)
gabrielparriaux commented 6 months ago

Much easier like this 😬. Thanks a lot for helping, this is exactly what I needed!