juba / rainette

R implementation of the Reinert text clustering method
https://juba.github.io/rainette/
53 stars 7 forks source link

Cluster segments by document of origin #28

Closed gabrielparriaux closed 1 year ago

gabrielparriaux commented 1 year ago

Hello @juba,

I would like to see, into one specific cluster, the repartition of segments according to the document they belong to.

For example, if I check for cluster 4, I would like a table saying that:

126 segments come from document 1 24 segments come from document 2 0 segment come from document 3 …

(or the same with percents)

I try to use docs_by_cluster_table() which I thought would do exactly this, but then I understand that it’s not giving what I want… it’s giving the number of documents that have segments belonging to the cluster.

Is there a simple way to get a table giving, for one cluster, the number of segments (or a percentage of them) that belong to every original document in the corpus?

Thanks a lot for your help!

Gabriel

juba commented 1 year ago

Hi,

Maybe with the clusters_by_doc_table() function ?

gabrielparriaux commented 1 year ago

Yes, but clusters_by_doc_table() gives me the repartition of the segments of one document between all clusters, like for document 1: 15% of the segments belong to cluster 1, 5% belong to cluster 2, 46% belong to cluster 3…

In a way, I would like the reverse of this: the repartition of the segments of one cluster between all documents. Something like: for cluster 1, 8% of the segments come from document 1, 34% of the segments come from document 2, aso.

Or is there a logical way to go from the one to the other? Am I missing something maybe?

Thanks a lot for helping,

Gabriel

juba commented 1 year ago

I think if you compute column percentages on the table produced by clusters_by_doc_table() you should get the result you want ?

Something like:

tab <- clusters_by_doc_table(corpus, "group")
tab |> mutate(
  across(
    where(is.numeric),
     ~.x / sum(.x) * 100
  )
)
gabrielparriaux commented 1 year ago

OK fantastic, this is what I need!

Then I get a table with 140 rows (documents from my corpus) × 34 columns (clusters).

If I want to visualize the repartition of documents into the clusters, but according to another variable (for example, the country of origin), I know how to add a column easily to my table with the value of the variable "country of origin" that I need.

But then, is there an easy way in R to regroup and compute the repartition of segments from the clusters by country of origin and not by document anymore? The idea would be to get a table with 3 rows (the three countries of origin of my corpus) and 34 columns (clusters), each cell counting the number of segments in this cluster coming from this country.

I’m sorry because I struggle to program correctly in R, I have seen so many times this kind of tidyverse very quick way of grouping commands to get a result, but I feel uneasy to create it correctly myself (I really need to follow some course!)

Thanks a lot if you can help me again, that’s very appreciated!

Gabriel

gabrielparriaux commented 1 year ago

I was able to do it with aggregate() function, I’m OK! Sorry for bothering you…

juba commented 1 year ago

Ah, nice ! Glad you found a way to do it !