Number of documents containing a specific token per cluster

juba / rainette

R implementation of the Reinert text clustering method

https://juba.github.io/rainette/

53 stars 7 forks source link

Number of documents containing a specific token per cluster #16

Closed gabrielparriaux closed 2 years ago

gabrielparriaux commented 2 years ago

Hello,

Iramuteq offers the possibility to operate a Correspondence analysis (CA) after Reinert classification.

I’m trying to reproduce it "manually" in R, using Rainette for the Reinert classification part.

As mentioned by Pierre Ratinaud, Iramuteq doesn’t use a simple count of tokens occurrences per cluster to build the contingency table, but instead it counts the number of documents containing the token per cluster.

From the rainette_stats function, I don’t see how I can get the information about the number of documents a token is belonging to (for each cluster).

Do you see an easy way to get these statistics?

Thanks a lot for your precious help,

Gabriel

juba commented 2 years ago

Hi,

If your document-feature matrix is in an object called dtm, I think the following should compute what you need.

res <- rainette(dtm, k = 5)
groups <- cutree(res, k = 5)

dtm |>
  dfm_weight("boolean") |>
  dfm_group(groups)

dfm_weight("boolean") transforms the dtm values into 0/1 (thus indicating if a term is present in the document or not), and dfm_group(groups) aggregates this transformed dfm by cluster.

Let me know if you think this is not what you are looking for.

gabrielparriaux commented 2 years ago

Great, this is exactly what I needed! Thanks a lot @juba!