Choice of the Reinert method, validation of the quality of the partitions, their number — search for information or references

gabrielparriaux commented 2 years ago

Hello,

I am conducting a research project in which I wish to investigate the discourse of teachers in the classroom. My corpus consists of transcriptions of recordings of about twenty lessons given by different teachers, or even more.

Having discovered Reinert's hierarchical top-down clustering, first in Iramuteq, then in R with the rainette package, I plan to use this clustering method to explore the discourse of teachers in class.

Not being a statistician myself, I think I understand the basics of the algorithm, but I'm trying to be more relevant on some aspects.

My questions concern the following topics:

Quality of each bipartition: in the case of the Reinert method, I think I understand that each bipartition is based on a Chi-square distance calculation and that one seeks to maximize this value to obtain the two best possible clusters at each division. Is it enough to stay with the calculation of this distance or should we be able to "qualify" or "validate" the quality of the classification of the different segments within a cluster? I am thinking of the Silhouette method, for example, which obviously allows doing this (https://en.wikipedia.org/wiki/Silhouette_(clustering));
choice of the number of clusters: in the different articles I have read so far on the use of the Reinert method, it seems that the final number of clusters is chosen by the researchers in a relatively approximate way... I am questioned, therefore, about the validity of the choices I make in terms of the number of clusters. Is there a way to rely on a statistical method to determine the best number of clusters to keep? The same Silhouette method I mentioned above apparently allows qualifying a clustering, and in particular to determine if one has chosen the right number of clusters. Would it be possible, or even appropriate, to use this method in the context of a Reinert clustering?
Reinert versus other clustering methods: I have the impression that the Reinert method is relatively old (1983) and that there are other more modern clustering methods, developed notably in the context of machine learning. In a relatively recent article that compares and evaluates clustering methods*, the Reinert method itself is not considered, but simply mentioned as being based on one of the oldest division methods (Williams and Lambert, 1959). In the context of textual analysis, do you know how it is possible to defend the use of this clustering method against the choice of other methods that would be more recent or even more efficient? Do you have some pointers in the literature that would support its use?

I am glad to receive any possible answer or reference to literature that could help me on these questions!

Thanks in advance for your help and best regards,

Gabriel Parriaux

Roux, M. (2018). A comparative study of divisive and agglomerative hierarchical clustering algorithms. Journal of Classification, 35(2), 345-366.

juba commented 2 years ago

I'm not an expert on this field, but here are some takes based on my not-so-long experience as a pratician with these methods :

Quality of each bipartition : obviously the choice of the Chi-square as the main metrics to compute clusters is not changeable, but you can apply almost any clustering quality assesment method you wish after computation. That's one of the advantages of applying Reinert inside R, you have all the R tools available afterwards to deal with the results.
Choice of the number of clusters : my opinion may be (rightly) disputed, but I have a rather pragmatic approach on this. When using this method, you will see that it is quite sensitive : sometimes by changing very few things in your corpus you will get quite different clusters (this is similar to what can be observed with MCA for example). So I think that the best clustering is the one that makes more sense, ie which is interpretable given other knowledge on the subject. Of course this could be viewed as a big risk of confirmation bias.
Reinert versus other clustering methods : it's true that the Reinert method is quite old, and not very used outside of french-speaking academic community. Since 1983 there have been other methods like LDA or LSA (machine learning methods are quite different I think because they are in general supervised, not exploratory). Claire Tissot tried to apply several methods to the same corpus and found that Reinert gave interesting results. So I don't think the age of the method should be sufficient to discard it. At least it is worth trying it to see if the results are interesting or not.

On a side note I think that modern dimension reduction algorithms such as t-SNE or UMAP applied to a document-form matrix could also give interesting results.

Sorry not to have real expert knowledge to share or definitive answers to give, hope it is helpful anyway.

gabrielparriaux commented 2 years ago

Hello @juba,

Thanks a lot for your very informed advice on the topic!

Quality of each bipartition: OK, that’s good to know that even if the Chi-square metrics are used to compute the clusters, it’s possible to use another clustering quality assessment method on top of it. I will try to apply Silhouette as it seems quite well known and as there exist tools to use it in R.
Choice of the number of clusters: personally, I agree with your opinion. Having tried different settings and number of clusters, it was not so hard to find the «good» number of clusters based on the interpretations that were made possible based on them. But it can be difficult to justify towards other researchers in terms of neutrality.
Reinert versus other clustering methods: thank you for your opinion and also the pointer to Claire Tissot’s article, which is very nice. I’m personally totally convinced by the opportunity of Reinert’s clustering method and I had really interesting results with it already, no doubt. I try to collect arguments in favor of the method in comparison with others. I was thinking of word2vec, but it is a supervised method and I understand it can be quite different.

I had never heard of t-SNE and UMAP. If I understand correctly, they are other solutions than Correspondence Analysis to reduce dimensions. I saw very nice simulations online using those algorithms, seems interesting!

Again, thanks a lot for your expertise and for your time to answer these questions!

Best,

Gabriel

juba / rainette

Choice of the Reinert method, validation of the quality of the partitions, their number — search for information or references #21