juba / rainette

R implementation of the Reinert text clustering method
https://juba.github.io/rainette/
53 stars 7 forks source link

Questions with double classification parameters #19

Closed gabrielparriaux closed 2 years ago

gabrielparriaux commented 2 years ago

Hello,

I’m performing a double classification and have several questions considering the parameters.

  1. I understand that there is no dendrogram showing in the plot when you do a double classification, but still it is possible to play with the "Number of clusters" cursor on the left pane to change the number of clusters and to show two, then three, then four… until the maximum number of clusters we created, giving an impression of an order of appearance of the clusters. Does this order of appearance of the clusters make any sense in the analysis or should we better not consider it? It seems to be relevant in a simple analysis, but I’m not sure it is also relevant in a double one.
  2. I thought that doing an analysis with k = 8 and then reducing to 7 by moving the "Number of clusters" cursor to 7 would give the same result than performing an analysis with k = 7. But I realized that this is not the case. Results are different! I’m not sure to understand why and is it possible (if not too difficult) to explain why?
  3. What are the "usual" parameters we should try for min_segment_size in the first two simple classifications? I want to test different settings to choose the best one, but it’s not easy to know whether I tested a good sample of settings or if I missed something. For the moment, I tested with the pairs:
res1 <- rainette(dtm, k = 10, min_segment_size = 8)
res2 <- rainette(dtm, k = 10, min_segment_size = 10)

res1 <- rainette(dtm, k = 10, min_segment_size = 8)
res2 <- rainette(dtm, k = 10, min_segment_size = 12)

res1 <- rainette(dtm, k = 10, min_segment_size = 8)
res2 <- rainette(dtm, k = 10, min_segment_size = 15)

res1 <- rainette(dtm, k = 10, min_segment_size = 10)
res2 <- rainette(dtm, k = 10, min_segment_size = 12)

res1 <- rainette(dtm, k = 10, min_segment_size = 10)
res2 <- rainette(dtm, k = 10, min_segment_size = 15)

res1 <- rainette(dtm, k = 10, min_segment_size = 12)
res2 <- rainette(dtm, k = 10, min_segment_size = 15)

Thank you for your help,

Gabriel

juba commented 2 years ago
  1. No, there is no cluster order and no cluster hierarchy when doing a double classification. That's why there is no dendrogram.

  2. When doing a double classification with k=7, you first do 7 clusters twice with different min_segment_size and then cross them together to compute the best partitions of k=1, 2, 3, 4... cross-clusters. When you do a double classification with k=8, you do the same thing but this time with 8 clusters computed twice. So the clusters you will be crossing in the following steps will be different, so that's normal that the results will be different. I'm not sure to be very clear, but at the end of the following video I try to understand the double classification, this may be useful : https://www.youtube.com/watch?v=T9r8T5WZYHY

  3. Unfortunately there is really no rule that I know of to get pertinent min_segment_size values, I think it really depends on your corpus. I usually try sizes of 10, 15, 20...

gabrielparriaux commented 2 years ago

Ok, thank you for your answers! I will look again at the video, but your explanation is very straightforward.