What community detection methods do we care about?

jwzimmer-zz commented 3 years ago

I'm looking at the options available in https://networkx.org/documentation/stable/reference/algorithms/community.html...

Kernighan-Lin bipartition
- https://en.wikipedia.org/wiki/Kernighan%E2%80%93Lin_algorithm
- We don't have any reason to think two disjoint subsets of the network would be particularly meaningful... maybe it would be interesting with respect to gender or happiness (sentiment analysis) but not obviously relevant
K-Clique percolation method
- https://en.wikipedia.org/wiki/Clique_percolation_method
- Can be used with directed, undirected, weighted, unweighted edges
- Allows overlap between communities and doesn't require assuming number of communities or sizes in advance, so seems good for us (since we don't know much about the network structure)
Modularity
- Finds clusters based on the ratio of inter-cluster degrees to intra-cluster degrees
- I think meaningful clusters based on this could exist, so I think it makes sense for us to use
Tree partitioning
- I really don't expect the network to be very tree-like, I think it probably has tons of loops, so this seems like it probably isn't relevant for us
Label propagation
- Semi-supervised machine learning to label nodes...
- This could potentially be interesting after we have tokenized/ cleaned the node titles (the index titles and the trope titles) for sentiment analysis, since we could then see what categories were generated based on our manually-labelling a few categories of interest, or based on some indices that seem related in some way, or we could ask it what tropes should be called and see how different the proposed names were. However, I think this is pretty likely to end up being out of scope for this project.
Fluid communities
- https://arxiv.org/abs/1703.09307
- TBH, can't really tell what this means - I guess, sure, why not use it, if we have time? - but I don't see any clear reason to think it's particularly relevant to us
Partitions via centrality measures
- https://en.wikipedia.org/wiki/Girvan%E2%80%93Newman_algorithm
- I think this makes sense for our network, since it sort of tests where the network can break apart.
- Also, it's a kind of dendrogram method, which Jane Adams recommended we try.

jwzimmer-zz commented 3 years ago

TL;DR I think we should use the following community detection strategies:

K-clique percolation
Modularity
Partitions via centrality

That means in networkx our options are:

jwzimmer-zz commented 3 years ago

@nguyenhphilip can you look this over and let me know your opinion?

jwzimmer-zz commented 3 years ago

Ok, great, got affirmation for this plan from Phil, so I'll close this case as resolved! 👍

nguyenhphilip commented 3 years ago

Just watched some videos on these methods and agree that these three methods (k-clique, modularity optimization, Girvan-Newman edge betweenness) look good!

It seems like we should just try running them on our masterlist and seeing what pops out. K-clique seems like it will give us larger but fewer communities, which would be nice for visualization. I suspect these may take awhile to run, so I will start trying to implement them this week.

jwzimmer-zz commented 3 years ago

Unfortunately we're both running into prohibitively slow runtimes - how can we make the network smaller?

jwzimmer-zz commented 3 years ago

maybe we only care about very different indices, since the indices that have almost everything in common are almost the same?

jwzimmer-zz / tv-tropes

What community detection methods do we care about? #21