Filter a VertexClustering object based on cluster sizes

ntamas commented 2 years ago

What is the feature or improvement you would like to see?

It is a common requirement when clustering a large graph to plot only the larger communities but not the smaller ones, as illustrated by this SO question.

Maybe it would be even more flexible if we allowed the user to pass an upper and a lower size limit, or an aribtrary filtering predicate based on the sizes.

I've posted a quick sketch of a possible implementation in the SO question.

Use cases for the feature

See above; the SO question provides a possible use-case.

References

N/A

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

5anperez commented 1 year ago

Hello! I would like to contribute to igraph and noticed that this is labeled as a good first issue, which I am hunting for, since it will be my very first contribution. Is this still open and if so, is everything I need to get starrted in the wiki?

ntamas commented 1 year ago

Yes, the issue is still open and feel free to get started on it. There's no wiki page but the README contains some instructions for getting started with development (after checking out the repo and all its submodules); see here.

I'd suggest that you open a PR with your draft as soon as possible (and mark it as draft) and commit often so we can give you early feedback if needed.

As for this specific task, you can use my SO response as a starting point, but it needs to be made a proper method of the VertexClustering class. I'd probably add a new filter_by_size() method to the VertexClustering class.

szhorvat commented 1 year ago

Wouldn't it make sense to try to design some sort of generalized filtering mechanism instead of a one-trick method that only handles sizes?

What if:

I want to filter by total vertex weight or total edge weight inside of a partition.
I want to select the partition containing a specific vertex.
I want to select partitions that contain vertices with a specific property, stored in some attribute. For example, select partitions that have a vertex whose name starts with V, representing areas in the visual cortex within a brain network.
I want to filter an overlapping clustering represented using VertexCover, not VertexClustering

Generally speaking, the area where most open source projects don't do so well is having a coherent, predictable, well thought out design. If you work with a system like Mathematica often enough, this becomes painfully obvious. I would like igraph to stand out from the crowd by being better in this area.

ntamas commented 1 year ago

Wouldn't it make sense to try to design some sort of generalized filtering mechanism instead of a one-trick method that only handles sizes?

I was considering this, but I don't want to overstretch the scope of this issue. It's always possible to add such a filtering method later and then rewrite filter_by_size() based on this. One advantage of adding a separate method for size (which is the most common filtering that people want to use) is that you don't actually need to construct the subgraph.

That being said, I can imagine adding a filter() method to VertexClustering that takes a Boolean predicate. The predicate would be a Python callable that takes the list of vertices in the cluster and returns True/False, but in that case it would be the job of the predicate to construct the subgraph on-the-fly if needed.

szhorvat commented 1 year ago

Alright, that makes sense. But it's still a useful exercise to think about how a generalization could be formulated and then implemented. Creating the subgraph seems like a heavyweight operation. Would it make sense to add a helper function to the C core which can retrieve the IDs of edges contained within a subgraph induced by some vertices? Then it will become relatively easy to compute things like:

Total number of edges in community
Total edge weight in community
Any statistic (e.g. variance) of edge weights within community

More generally, it might make sense for a filtering method to have access to:

Vertices and their attributes
Edges and their attributes

The first is trivial, as the clusters are generally understood in terms of their vertices and API provides access to these through .as_cover(). There does not seem to be a simple way for the second. Am I missing something? If not, do you agree that it makes sense to add the support function to the C core, or at least discuss these ideas further somewhere else (forum?)

ntamas commented 1 year ago

Would it make sense to add a helper function to the C core which can retrieve the IDs of edges contained within a subgraph induced by some vertices?

Sounds like a good idea.

More generally, it might make sense for a filtering method to have access to:

I think that all that the filtering function needs access to is the graph object itself and the members of the current cluster being considered. The graph might not even be needed as you can always add it using functools.partial, e.g.:

def some_predicate(graph, vertices):
    # do something and return True/False here

clusters.filter(partial(some_predicate, graph))

This would allow .filter() to be implemented in the Clustering base class where we don't already have a graph.

But again, I still maintain that this is outside the scope of the current issue, which only asks for filtering based on cluster sizes, so in order not to derail the current discussion, I'd recommend opening a separate issue for a more generic filtering method.

palash018 commented 1 year ago

Is anyone still working on this issue currently?

igraph / python-igraph

Filter a VertexClustering object based on cluster sizes #507