juba / rainette

R implementation of the Reinert text clustering method
https://juba.github.io/rainette/
53 stars 7 forks source link

Recommendation for clustering of dialogues #32

Closed gabrielparriaux closed 6 months ago

gabrielparriaux commented 6 months ago

Hello @juba,

I’m working with a corpus of dialogues composed of a lot of short sentences by different actors. Some sentences contain only 2 or 3 words.

When I use the split_segments() function with a segment_size = 40, I obtain only a few more documents than before the splitting. For example, before splitting, I have 635 documents and after splitting, I get 668 documents (segments).

Do you have any special recommendations about performing clustering with this kind of corpus?

Should I be aware of something special, or use special settings in the clustering to have proper results?

Thanks a lot for your advice!

Gabriel

juba commented 6 months ago

I don't have a real experience with this kind of data, so I won't have any solid advice unfortunately. Splitting into segments is the only useful for longer texts in order to try to get "homogeneous" units of meaning. So if you work on short dialogues, this may not be necessary.

For really short dialogues you can try to filter out those under a certain number of words if they don't produce any meaningful result in your clustering.

gabrielparriaux commented 6 months ago

Thanks a lot @juba for your answer about this.

The idea of filtering the segments under a certain number of words seems very good, because I have a lot of very small clusters composed of just one or two very short segments… I might better get rid of them! I’ll try it!