Closed gabrielparriaux closed 1 year ago
I'm not an expert on this field, but here are some takes based on my not-so-long experience as a pratician with these methods :
On a side note I think that modern dimension reduction algorithms such as t-SNE or UMAP applied to a document-form matrix could also give interesting results.
Sorry not to have real expert knowledge to share or definitive answers to give, hope it is helpful anyway.
Hello @juba,
Thanks a lot for your very informed advice on the topic!
I had never heard of t-SNE and UMAP. If I understand correctly, they are other solutions than Correspondence Analysis to reduce dimensions. I saw very nice simulations online using those algorithms, seems interesting!
Again, thanks a lot for your expertise and for your time to answer these questions!
Best,
Gabriel
Hello,
I am conducting a research project in which I wish to investigate the discourse of teachers in the classroom. My corpus consists of transcriptions of recordings of about twenty lessons given by different teachers, or even more.
Having discovered Reinert's hierarchical top-down clustering, first in Iramuteq, then in R with the rainette package, I plan to use this clustering method to explore the discourse of teachers in class.
Not being a statistician myself, I think I understand the basics of the algorithm, but I'm trying to be more relevant on some aspects.
My questions concern the following topics:
Quality of each bipartition: in the case of the Reinert method, I think I understand that each bipartition is based on a Chi-square distance calculation and that one seeks to maximize this value to obtain the two best possible clusters at each division. Is it enough to stay with the calculation of this distance or should we be able to "qualify" or "validate" the quality of the classification of the different segments within a cluster? I am thinking of the Silhouette method, for example, which obviously allows doing this (https://en.wikipedia.org/wiki/Silhouette_(clustering));
choice of the number of clusters: in the different articles I have read so far on the use of the Reinert method, it seems that the final number of clusters is chosen by the researchers in a relatively approximate way... I am questioned, therefore, about the validity of the choices I make in terms of the number of clusters. Is there a way to rely on a statistical method to determine the best number of clusters to keep? The same Silhouette method I mentioned above apparently allows qualifying a clustering, and in particular to determine if one has chosen the right number of clusters. Would it be possible, or even appropriate, to use this method in the context of a Reinert clustering?
Reinert versus other clustering methods: I have the impression that the Reinert method is relatively old (1983) and that there are other more modern clustering methods, developed notably in the context of machine learning. In a relatively recent article that compares and evaluates clustering methods*, the Reinert method itself is not considered, but simply mentioned as being based on one of the oldest division methods (Williams and Lambert, 1959). In the context of textual analysis, do you know how it is possible to defend the use of this clustering method against the choice of other methods that would be more recent or even more efficient? Do you have some pointers in the literature that would support its use?
I am glad to receive any possible answer or reference to literature that could help me on these questions!
Thanks in advance for your help and best regards,
Gabriel Parriaux