juba / rainette

R implementation of the Reinert text clustering method
https://juba.github.io/rainette/
53 stars 7 forks source link

rainette_stats() and criteria of selection of most strongly associated tokens #23

Closed gabrielparriaux closed 1 year ago

gabrielparriaux commented 1 year ago

Dear Julien,

I’m using rainette_stats() to get the statistics of the tokens that are associated with each cluster.

In the list of results, for each cluster, I would like to make a selection of the tokens (in order to limit the tokens to display in a correspondence analysis).

What would be the correct criteria to decide where to limit my list?

Should I better select for each cluster the first n tokens or should I better select the tokens having a chi-square superior to a certain value (same value for every cluster)?

Thanks a lot if you have advice about that (and sorry for the quite non-technical question… I have no idea where else to ask for this…)

Gabriel

juba commented 1 year ago

I don't think there's really one "good answer" to your question. You could filter the tokens based on your correspondance analysis results, take the n most specific, or set a keyness value threshold.

In these cases I tend to pragmatic, try the different possibilities and see what gives the most interesting and readable results...

gabrielparriaux commented 1 year ago

Thanks a lot for the pragmatic answer, I will then try different possibilities and choose what is more understandable!