igraph / rigraph

igraph R package
https://r.igraph.org
544 stars 200 forks source link

cluster_leiden() docs should be explicit about the objective function #529

Open szhorvat opened 2 years ago

szhorvat commented 2 years ago

The cluster_leiden() documentation does not have the formula for the objective function. What is being maximized using CPM and modularity, respectively?

Additionally, since CPM is the default, the docs should provide guidance on how to get good results with CPM, i.e. how to set the resolution. The default value of 1 does not usually lead to good results. As a result, it's almost guaranteed that the results will not be good if using the default settings (a simple cluster_leiden(g) call). In my mind, this makes CPM a bad default choice unless it is accompanied by an appropriate default for resolution as well—but this is for @vtraag to decide. I expect that most users won't even read the documentation, and instead will just try the defaults.

vtraag commented 2 years ago

Yes, fair point about the explicit formula's, probably good to include that.

What might be a reasonable default, but still keeping CPM as a default, and without doing some annoying normalisation or something, would perhaps be the following. Usually, in most cases, resolutions of interest are located somewhere around the graph density, so we could take resolution_parameter = edge_density(graph) as a starting point.

There are two difficulties however with this default. If a user would select objective_function = "modularity" this default would not make much sense, and the default of 1 would make more sense. Secondly, if weights are supplied, the "weighted density" (i.e. sum(weights)/(n(n - 1)/2) with n the number of nodes) should be used as a default instead.

I'm not directly sure how this can best be implemented in R. It seems you can depend on other arguments in the default argument, right? If so, something like

resolution_parameter = switch(objective_function,
                              CPM=if (weights) weighted_density(graph) else density(graph),
                              modularity=1)

might do the trick? (Not sure if this type of inline if works in R)

ntamas commented 2 years ago

It seems you can depend on other arguments in the default argument, right?

I think this would work, but you need to write something like ifelse(is.null(weights), density(graph), weighted_density(graph)). But IMHO weighted_density(graph, weights) could fall back to the unweighted density if weights is NULL or NA (assuming we have such a function at all).

And of course you need to wrap the whole thing in a giant if (is.missing(resolution_parameter)) to let the user bring their own.

vtraag commented 2 years ago

I think this would work, but you need to write something like ifelse(is.null(weights), density(graph), weighted_density(graph)). But IMHO weighted_density(graph, weights) could fall back to the unweighted density if weights is NULL or NA (assuming we have such a function at all).

Right, thanks!

And of course you need to wrap the whole thing in a giant if (is.missing(resolution_parameter)) to let the user bring their own.

I believe I can write this simply as the default within the function declaration. That way, if a user supplies its own value it would automatically always use that, right?

ntamas commented 2 years ago

Yup, but maybe for the sake of readability it's better to put it in the function body. I leave it up to you.

szhorvat commented 2 years ago

Can the default resolution_parameter value not depend on the value of objective_function?

Another comment: would it be better to use the short resolution instead of the very long resolution_parameter? cluster_louvain uses resolution.

vtraag commented 2 years ago

Can the default resolution_parameter value not depend on the value of objective_function?

I'm not sure if I understand this. I propose that it does depend on the value of the objective_function. By design it has to, the same resolution default do not make sense for both objective functions. What would you propose exactly?

Another comment: would it be better to use the short resolution instead of the very long resolution_parameter? cluster_louvain uses resolution.

Yes, that's no problem of course.

szhorvat commented 2 years ago

What is the "weighted density"?

vtraag commented 2 years ago

"weighted density" (i.e. sum(weights)/(n(n - 1)/2) with n the number of nodes)

vtraag commented 2 years ago

I've improved the documentation in https://github.com/igraph/rigraph/commit/49f8b1714c97f3519a358763b6039866a0fb1fb4. I haven't yet changed the default settings, so we should still leave this issue open.

ntamas commented 2 years ago

Moving this to a future release as the documentation was improved, and changing the default settings would be a potentially breaking change and I'd rather not re-run ~900 revdepchecks now :)

clpippel commented 2 years ago

For a small unweighted graphs with degree outliers, like Zachary, an alternative is:

median(degree(g)) / (vcount(g) - 1) or quantile(degree(g))[2] / (vcount(g) - 1)

Note: strength(g) works for both weighted and unweighted graphs.

szhorvat commented 9 months ago

@vtraag Should the default objective function be changed to modularity for 2.0?

I've been seeing repeated complaints on GitHub / StackOverflow that "Leiden doesn't work as well as Louvain / doesn't give similar results / only gives one community". This is likely due to the naive expectation that cluster_leiden should maximize modularity by default. I know that you have good arguments in favour in CPM, but these go way over the head of the typical biologist ...

clpippel commented 9 months ago

@ szhorvat

In my humble opinion as a casual user: probably not.

The problem is that the default resolution parameter is set to 1 as a default value. This gives almost certainly an undesirable partition when the objective function is CPM.

g <- make_graph('zachary'); plot(cluster_leiden(g), g)

My suggestion. If the objective is "CPM" then set the resolution parameter to either:

The latter offers the user an obvious way to try different granularities.

Other suggestion. Copy the input parameters "objective_function" and "resolution_parameter" to the output object, for reason of reproducibility.