scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.92k stars 599 forks source link

Unexpected behavior of `sc.pp.neighbors` #2587

Closed almaan closed 1 year ago

almaan commented 1 year ago

Please make sure these conditions are met

What happened?

Hej!

Thanks for maintaining such a great package! This issue relates another issue posted (by me) in the squidpy repo, but I think it might be worth bringing up here as well.

The issue in question is how the sc.pp.neighbors function returns an inconsistent number of neighbors even when knn=True. In the documentation of sc.pp.neighbors it's stated that :

n_neighbors: The size of local neighborhood (in terms of number of neighboring data points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100. If knn is True, number of nearest neighbors to be searched. If knn is False, a Gaussian kernel width is set to the distance of the n_neighbors neighbor.

and

knn: If True, use a hard threshold to restrict the number of neighbors to n_neighbors, that is, consider a knn graph. Otherwise, use a Gaussian Kernel to assign low weights to neighbors more distant than the n_neighbors nearest neighbor.

as well as

Connectivities: Weighted adjacency matrix of the neighborhood graph of data points. Weights should be interpreted as connectivities.

Hence I would expect that the number of non-zero elements in adata.obsp['connectivities'] in an object to which sc.pp.neighbors(adata, n_neighbors = k, knn = True) have been applied, would sum to k for each row. However, when inspecting these results, it is not true. The number of non-zero elements in a row varies between both higher as well as lower values than the specified n_neighbors (obviously, sometimes it's also the expected n_neighbors value).

Perhaps I'm misunderstanding something, but this behavior is somewhat counterintuitive to me and not what I expect; happy to be corrected though!

/Alma

Minimal code sample

# Import packages

import scanpy as sc
import anndata as ad
import numpy as np

# set random seed
np.random.seed(42)

# create dummy data

adata = ad.AnnData(shape=(1000,1))
adata.obsm['rep'] = np.random.random(size = (1000,2))

# get spatial connectivities
k = 10
sc.pp.neighbors(adata, n_neighbors=k, use_rep = 'rep', knn = True)

# get and count connectivities for each cell
gr = adata.obsp['connectivities']
nn = (np.array(gr.todense()) > 0).sum(axis=1).flatten()

# check if neighbors are equal to k
np.testing.assert_equal(nn,k)

Error output

No response

Versions

``` ----- anndata 0.9.1 scanpy 1.9.3 ----- PIL 9.4.0 anyio NA arrow 1.2.3 asciitree NA asttokens NA attr 23.1.0 babel 2.12.1 backcall 0.2.0 brotli NA certifi 2022.12.07 cffi 1.15.1 charset_normalizer 2.0.4 cloudpickle 2.2.1 comm 0.1.3 cycler 0.10.0 cython_runtime NA dask 2023.5.1 dateutil 2.8.2 debugpy 1.6.7 decorator 5.1.1 defusedxml 0.7.1 entrypoints 0.4 executing 1.2.0 fasteners 0.18 fastjsonschema NA fqdn NA h5py 3.8.0 idna 3.4 igraph 0.10.4 ipykernel 6.23.1 isoduration NA jedi 0.18.2 jinja2 3.1.2 joblib 1.2.0 json5 NA jsonpointer 2.3 jsonschema 4.17.3 jupyter_events 0.6.3 jupyter_server 2.6.0 jupyterlab_server 2.22.1 kiwisolver 1.4.4 leidenalg 0.9.1 llvmlite 0.40.1 markupsafe 2.1.2 matplotlib 3.7.1 mpl_toolkits NA msgpack 1.0.5 natsort 8.3.1 nbformat 5.8.0 numba 0.57.1 numcodecs 0.11.0 numpy 1.23.4 overrides NA packaging 23.1 pandas 2.0.1 parso 0.8.3 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA platformdirs 3.5.1 prometheus_client NA prompt_toolkit 3.0.38 psutil 5.9.5 ptyprocess 0.7.0 pure_eval 0.2.2 pyarrow 12.0.1 pycparser 2.21 pydev_ipython NA pydevconsole NA pydevd 2.9.5 pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.15.1 pynndescent 0.5.10 pyparsing 3.0.9 pyrsistent NA pythonjsonlogger NA pytz 2023.3 requests 2.28.1 rfc3339_validator 0.1.4 rfc3986_validator 0.1.1 scipy 1.10.1 send2trash NA session_info 1.0.0 setuptools 66.0.0 six 1.16.0 sklearn 1.2.2 sniffio 1.3.0 socks 1.7.1 stack_data 0.6.2 tblib 2.0.0 texttable 1.6.7 threadpoolctl 3.1.0 tlz 0.12.0 toolz 0.12.0 torch 1.13.1 tornado 6.3.2 tqdm 4.65.0 traitlets 5.9.0 typing_extensions NA umap 0.5.3 uri_template NA urllib3 1.26.15 wcwidth 0.2.6 webcolors 1.13 websocket 1.5.2 yaml 6.0 zarr 2.14.2 zipp NA zmq 25.1.0 zoneinfo NA ----- IPython 8.13.2 jupyter_client 8.2.0 jupyter_core 5.3.0 jupyterlab 4.0.1 ----- Python 3.10.11 (main, Apr 20 2023, 19:02:41) [GCC 11.2.0] Linux-3.10.0-1160.66.1.el7.x86_64-x86_64-with-glibc2.17 ----- Session information updated at 2023-08-02 14:21 ```
eroell commented 1 year ago

Hi Alma,

thanks for raising your thoughts here!

I’ll try to clarify the output a bit and tag @ivirshup here.

sc.pp.neighbors produces two main results, which it indeed stores in the ad.obsp:

  1. A distance matrix in adata.obsp['distances']. This matrix has shape (n_obs, n_obs): for each observation, only n_neighbors-1entries will be non-zero. The nearest neighbor of an observation, itself with distance 0, is discarded, hence the -1. It is probably what you have been thinking of in your description.

  2. A connectivity graph in adata.obsp['connectivity']. This graph has shape (n_obs, flexible), where the flexible number of connections for each observation are determined during the UMAP algorithm.

Hence if you’re interested in the distance matrix, adata.obsp['distances'] would be what you’re looking for! Coming back to your code example, here the test should be a pass:

# Import packages

import scanpy as sc
import anndata as ad
import numpy as np

# set random seed
np.random.seed(42)

# create dummy data
adata = ad.AnnData(shape=(1000,1))
adata.obsm['rep'] = np.random.random(size = (1000,2))

# get spatial connectivities
k = 10
sc.pp.neighbors(adata, n_neighbors=k, use_rep = 'rep', knn = True)

# get and count connectivities for each cell
gr = adata.obsp['distances']
nn = (np.array(gr.todense()) > 0).sum(axis=1).flatten()

# check if neighbors are equal to k-1
np.testing.assert_equal(nn, k-1)

Might actually try to clarify this in documentation, small PR addressing this will follow soon.

How does that sound to you? Please persist if you think I miss the point!

That being said, I think that the computation of the distance matrix and the connectivity graph are both correct.

eroell commented 1 year ago

At the moment I believe this might be more of a documentation issue rather than a bug so I changed the label - if you'd have a follow-up or related issue kindly let me know :)

eroell commented 1 year ago

We will close the issue for now, hopefully this has been addressed helpfully :)

However, please don't hesitate to reopen this issue or create a new one if you have any more questions or run into any related problems in the future.

Thanks for being a part of our community! :)

almaan commented 1 year ago

Hej,

first of all, sorry for a slow reply. Thank you @eroell for the explanation, that fully makes sense - but as you're saying, perhaps a clarification in the documentation would make sense. I wasn't the only one in my team confused by this.

Thank you for maintaining this package and all the great work you guys are doing!

eroell commented 1 year ago

Hi,

thanks a lot for the follow-up here! Happy to hear the explanation made sense - Hopefully the change in the doc adds some clarification from the start.

Thanks for being a part of this community!