juba / rainette

R implementation of the Reinert text clustering method
https://juba.github.io/rainette/
53 stars 7 forks source link

Multiple Correspondence Analysis with words and rainette data #25

Closed gabrielparriaux closed 1 year ago

gabrielparriaux commented 1 year ago

After having done a Reinert's clustering with rainette, I would like to do a Multiple Correspondence Analysis (MCA) that crosses the words of the lexicon (or a part of them) with other categorical variables (the clusters, but also other docvars that I have about my documents).

I have seen your answer to another question where you show how to perform MCA that crosses clusters with categorical variables, it’s nice and it works perfectly.

But how could I do such an analysis with all the lexicon instead of just the clusters?

In my understanding, we have to create a table that joins a frequency table (with the words) and a categorical variables table like in this example, having the documents (or segments) as rows.

Capture d’écran 2023-01-30 à 16 56 22

In the literature, I have seen two kind of solutions:

— Cibois who proposes to create or a Burt table or a "lexical table of the questions" (tableau lexical des questions in French)

Capture d’écran 2023-01-30 à 16 47 02

Cibois, P. (1990). Éclairer le vocabulaire des questions ouvertes par les questions fermées : Le tableau lexical des questions. Bulletin of Sociological Methodology/Bulletin de Méthodologie Sociologique, 26(1), 12‑21.

— Bécue Bertaut and colleagues who designed a special method (CA-GALT method) to deal with those cases (the first image is taken from their article)

Kostov, B. A., Bécue Bertaut, M. M., & Husson, F. (2015). Correspondence analysis on generalised aggregated lexical tables (CA-GALT) in the FactoMineR package. R Journal, 7(1), 109‑117.

I’m not sure of the way to deal with it.

Do you have any idea about that?

Sorry for the question if it’s off topic regarding rainette… (it’s a kind of continuation)

Thanks a lot for helping!

juba commented 1 year ago

I'm not really familiar with these methods, but if I understand the first figure correctly, in this case your Y table would be the document-feature matrix (documents as rows, terms as columns), and your X table would be the docvars() of your corpus (documents as rows, metadata variables as columns) ?

So in theory you could use FactoMineR::CaGalt on these two tables ?

gabrielparriaux commented 1 year ago

Yes, this is correct: Y table would be the document-feature matrix and X table would be the docvars(). And yes again, in theory I think I could use FactoMineR::CaGalt on those two.

I wanted to know if you thought that it was the proper way to do it or if there was a better solution. I will try to do it that way! Thank you for helping!