Add max_comm_size to constrain the community size.

orenbenkiki commented 4 years ago

First, thanks for publishing this package, and for supporting the "Surprise" goal function. I was working along these lines and just having it all be ready-made (and published) helped a lot.

I'm re-implementing the Metacell algorithm for analyzing single-cell RNA sequence data, and scaling it up to efficiently deal with very large data (many millions of cells) by converting it to a divide-and-conquer algorithm. I'm working on this as my PhD project in the Weizmann Institure for Science in Amos Tanay's lab.

This pull request implements a minor but important (for us) tweak to the basic clustering algorithm - restricting the maximal community size (sum of node sizes). Since metacells aren't "classical" clusters (in particular, they are defined as "the minimal sized community" that satisfies some conditions), this tweak makes it efficient to generate candidate metacells based on some KNN-graph, as part of the complete (complex) pipeline. I found that using this modified version saved CPU time and gave good results.

I tried to make this as seamless as possible with the existing code; two points where I had to tweak things were ensuring all vertex partitions allowed specifying node sizes, even if they did not directly use them, to allow passing them on to the algorithm itself; and change the parameters order for create_graph_from_py (since node_sizes became mandatory).

orenbenkiki commented 3 years ago

Just fixed a typo in the documentation - had "used used" in several places (the dangers of copy-and-paste), changed it to a single "used".

Happy new (Jewish) year!

vtraag commented 3 years ago

Hi, thanks! You too! Thanks for the PR btw, I am working on a review, I didn't get around to finishing it earlier on. I hope to get around to it this weekend somewhere...

orenbenkiki commented 3 years ago

Fixed all issues (I think) as far as this "weak form" pull request goes. As for the "strong form", I suggest that if we want to pursue it, it should be done not here but in a dedicated "issue".

Thanks for the review!

vtraag commented 3 years ago

Looks good, thanks for the quick response!

I'll think about your comment on the "strong form", perhaps the solution I have in mind also does not work well. The idea I had was quite simple in fact: we do not set the current community as the maximum one, and we set the initial maximum improvement to -infinity, while adding the current community to the set of communities to consider. That way, if the current community is too large it will not be considered. There will always be a possibility to move it to an empty community (assuming that option is turned on). What do you think?

We also have to deal with too small clusters in our work (clustering scientific publications). In that case, we post-process to join too small communities with larger communities, such that is minimises the loss of forcing the solution to exclude too small communities. If you are interested, I can talk you through this. It is more difficult to enforce this during the running of the algorithm however, since initially all communities are too small.

orenbenkiki commented 3 years ago

Thanks for the quick reply.

I'll think about your comment on the "strong form", perhaps the solution I have in mind also does not work well. The idea I had was quite simple in fact: we do not set the current community as the maximum one, and we set the initial maximum improvement to -infinity, while adding the current community to the set of communities to consider. That way, if the current community is too large it will not be considered. There will always be a possibility to move it to an empty community (assuming that option is turned on). What do you think?

So, if I get this right, this effectively will split the 1st several nodes from the too-large community, forcing them to either join other existing or create new communities. An elegant solution and easy to implement.

EDIT: I can add this to the code if you want, assuming you are OK with the caveats below.

EDIT 2: Ok, I did implement this, it was too simple and elegant to pass. The following caveats still bother me, though:

There's a nasty edge case when all nodes belong to a single too-large community and we are not allowed to create new communities, though.

EDIT 2: Turns out this isn't an issue if I initialize max_comm to v_comm.

That aside, I'm a bit worried in that the nodes chosen to be split from the too-large community are essentially chosen at random, and this may impact the optimality / efficiency of the algorithm.

A more serious potential concern is if we have "very" too-large communities (e.g., several times the maximal size). It might be more efficient to run the overall algorithm (or a simplified version of it) on the too-large community to split it into several smaller ones? If that is an overkill, perhaps even a simple "split to random subsets of roughly equal size" step might end up being more efficient?

EDIT 2: I still think the above two are a concern, but I agree it is more consistent to enforce max_comm_size even if done in a sub-optimal way. I'd still like to hear what you think about possible improvements to this.

We also have to deal with too small clusters in our work (clustering scientific publications). In that case, we post-process to join too small communities with larger communities, such that is minimises the loss of forcing the solution to exclude too small communities. If you are interested, I can talk you through this. It is more difficult to enforce this during the running of the algorithm however, since initially all communities are too small.

What I do right now is to convert each too-small community into a node, and run the algorithm on the resulting graph (containing only the too-small community nodes) to cluster them into larger ones. This seems to be "in the spirit" of the overall algorithm. Typically it solves the problem; sometimes I am left with "stubborn" too-small communities which really resist being clustered with anything else, which I basically consider to contain "outlier" nodes.

I would be interested in a better approach - as you say, I discovered it is difficult to enforce this during the algorithm (creating new communities becomes an issue).

vtraag commented 3 years ago

Thanks! I will merge after the CI has finished.

orenbenkiki commented 3 years ago

Thanks!

vtraag / leidenalg

Add max_comm_size to constrain the community size. #46