Interaction matrix intended results?

ivirshup commented 3 years ago

Description

I have a few questions about the interaction_matrix function. First, am I using the same definition as you?

To me, interaction matrix M for graph G and labelling c, M[i, j] should be the count of edges between the i nodes and the j nodes.

If this is the case, shouldn't M.sum() be equal to the edge count of the graph?

import squidpy as sq
from scipy import sparse

adata = sq.datasets.visium_fluo_adata()
sq.gr.spatial_neighbors(adata)
sq.gr.interaction_matrix(adata, cluster_key="cluster")
M = adata.uns["cluster_interactions"]

M.sum()
# 9828.0

adata.obsp["spatial_connectivity"].nnz
# 16328

I think this has something to do with treating the graph as undirected, since you get the same result if you only provide the upper triangular matrix:

adata.obsp["upper_spatial_connectivite"] = sparse.triu("spatial_connectivity")

sq.gr.interaction_matrix(adata, "cluster", "upper_spatial_connectivite", copy=True).sum()
# 9828.0

But these graphs aren't always going to be symmetric, so should this be the case?

Side note, wouldn't it be nice to be able to use the weights of the edges?

I've also got a little implementation that has these properties, and is a bit faster. Would be happy to make the PR.

Impl

```python import pandas as pd import numpy as np from numba import njit from scipy import sparse def interaction_matrix(g: sparse.spmatrix, labels, *, dtype=None, weights=False) -> pd.DataFrame: labels = pd.Series(labels).astype("category", copy=False) g = sparse.csr_matrix(g, copy=False) if weights: g_data = g.data else: g_data = np.ones(len(g.data), dtype=bool) if dtype is None: if pd.api.types.is_bool_dtype(g.dtype) or pd.api.types.is_integer_dtype(g.dtype): dtype = np.intp else: dtype = np.float64 n_cats = len(labels.cat.categories) output = np.zeros((n_cats, n_cats), dtype=dtype) return pd.DataFrame( _interaction_matrix(g_data, g.indices, g.indptr, np.asarray(labels.cat.codes), output=output), index=labels.cat.categories, columns=labels.cat.categories, ) @njit def _interaction_matrix(data, indices, indptr, cats, output): indices_list = np.split(indices, indptr[1:-1]) data_list = np.split(data, indptr[1:-1]) for i in range(len(data_list)): cur_row = cats[i] cur_indices = indices_list[i] cur_data = data_list[i] for j, val in zip(cur_indices, cur_data): cur_col = cats[j] output[cur_row, cur_col] += val return output # Also good, a bit slower def interaction_matrix(g, labels, *, weights=False): from sklearn.preprocessing import LabelBinarizer binarizer = LabelBinarizer(sparse_output=True) L = binarizer.fit_transform(labels) if not weights: g = g.astype(bool) return pd.DataFrame( L.T @ g @ L, index=binarizer.classes_, columns=binarizer.classes_, ) ```

Version

1.0.0

michalk8 commented 3 years ago

I think this has something to do with treating the graph as undirected, since you get the same result if you only provide the upper triangular matrix:

You're right, it's using undirected graph under the hood.

pinging @giovp regarding whether this is was done by design or by mistake. Also, I am in support of the weights, but maybe make it optional for backwards compatibility.

giovp commented 3 years ago

I think it's true and I think this is a bug, just checked again and the problem lies here: https://networkx.org/documentation/stable/reference/generated/networkx.convert_matrix.from_scipy_sparse_matrix.html

we should explicitly set nx.from_scipy_sparse_matrix(A, create_using=nx.DiGraph) otherwise by default it would use default=nx.Graph. Great catch @ivirshup !

This should also be changed here: https://github.com/theislab/squidpy/blob/b3b5f8a4784250c6792137853092a3d5c9dbf1c7/squidpy/gr/_nhood.py#L245

Also agree on the weights, and would also prefer your numba implementations over wrapping networkx! Would be great if you can open a PR @ivirshup , thank you!

michalk8 commented 3 years ago

closed via #302

scverse / squidpy

Interaction matrix intended results? #301

Description

Version