scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.9k stars 597 forks source link

pp.neighbors returns edgeless graph for n_neighbors=1 #1706

Open fbnrst opened 3 years ago

fbnrst commented 3 years ago

I worked on a very small bulk dataset recently (n_obs = 15), and wanted to try a 1-nearest-neighbor graph for the UMAP. However, edgeless graphs are returned for n_neighbors=1. I would expect that each node would have at least one edge in this case.

Minimal code sample (that we can copy&paste without having any data)

import scanpy as sc
adata = sc.datasets.blobs(n_observations=5)
sc.pp.neighbors(adata, n_neighbors=1)
print('Connectivities:\n', adata.uns['neighbors']['connectivities'].A)
sc.tl.umap(adata)
Connectivities:
 [[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:4: FutureWarning: This location for 'connectivities' is deprecated. It has been moved to .obsp[connectivities], and will not be accesible here in a future version of anndata.
  after removing the cwd from sys.path.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-28-2906b54049c5> in <module>
      3 sc.pp.neighbors(adata, n_neighbors=1)
      4 print('Connectivities:\n', adata.uns['neighbors']['connectivities'].A)
----> 5 sc.tl.umap(adata)

/opt/conda/lib/python3.7/site-packages/scanpy/tools/_umap.py in umap(adata, min_dist, spread, n_components, maxiter, alpha, gamma, negative_sample_rate, init_pos, random_state, a, b, copy, method, neighbors_key)
    194             neigh_params.get('metric', 'euclidean'),
    195             neigh_params.get('metric_kwds', {}),
--> 196             verbose=settings.verbosity > 3,
    197         )
    198     elif method == 'rapids':

/opt/conda/lib/python3.7/site-packages/umap/umap_.py in simplicial_set_embedding(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, output_metric, output_metric_kwds, euclidean_output, parallel, verbose)
   1022             n_epochs = 200
   1023 
-> 1024     graph.data[graph.data < (graph.data.max() / float(n_epochs))] = 0.0
   1025     graph.eliminate_zeros()
   1026 

/opt/conda/lib/python3.7/site-packages/numpy/core/_methods.py in _amax(a, axis, out, keepdims, initial, where)
     37 def _amax(a, axis=None, out=None, keepdims=False,
     38           initial=_NoValue, where=True):
---> 39     return umr_maximum(a, axis, None, out, keepdims, initial, where)
     40 
     41 def _amin(a, axis=None, out=None, keepdims=False,

ValueError: zero-size array to reduction operation maximum which has no identity

Versions

For me, scanpy.logging.print_versions() crashes due to the importlib_metadata issue.

scanpy==1.7.1 anndata==0.7.5 umap==0.4.6 numpy==1.20.0 scipy==1.6.0 pandas==1.2.1 scikit-learn==0.24.1 statsmodels==0.12.2 python-igraph==0.8.3 louvain==0.7.0 leidenalg==0.8.3
ivirshup commented 3 years ago

I think this has to do with us relying on UMAP. You can check this yourself in UMAP, but you'll actually end up with n-1 neighbors per node. I believe this has to do with each point being it's own nearest neighbor, but I forget if that's important for nearest neighbor descent algorithm (prevent node from adding itself by already having it in the heap) or UMAP (simplexes??). If I can find a link to where I read this, I'll share it here.

Two considerations:

I was definitely surprised when I read about this recently, and would be open to changing the behavior. It would effect reproducibility for everyone though.

fbnrst commented 3 years ago

It seems you do not always end up with n-1 neighbors, because for n=3, you suddenly get differing number of neighbors:

import scanpy as sc
adata = sc.datasets.blobs(n_observations=5)

for n_neighbors in [1, 2, 3]:
    sc.pp.neighbors(adata, n_neighbors=n_neighbors)
    print(f'n_neighbors = {n_neighbors}:\n', adata.uns['neighbors']['connectivities'].A)

Output:

n_neighbors = 1:
 [[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]
n_neighbors = 2:
 [[0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0.]]
n_neighbors = 3:
 [[0.         0.5849553  0.         1.         0.5849636 ]
 [0.5849553  0.         1.         0.5849678  0.        ]
 [0.         1.         0.         0.58496827 0.        ]
 [1.         0.5849678  0.58496827 0.         1.        ]
 [0.5849636  0.         0.         1.         0.        ]]

It is been while that I read about UMAP and can't get my head around why this happens right now. Relying on UMAP seems a good idea to me, maybe the corner case n_neighbors=1 should just be catched with a more meaningful error message?

ivirshup commented 3 years ago

It seems you do not always end up with n-1 neighbors, because for n=3, you suddenly get differing number of neighbors:

I think the varying number of neighbors is because the connectivity graph is made symmetric.

import scanpy as sc, numpy as np
pbmc = sc.datasets.pbmc68k_reduced()
sc.pp.neighbors(pbmc)

np.unique((pbmc.obsp["distances"] != 0).sum(axis=1).flat)
# array([14])

Yeah, n_neighbors=1 should definitely throw an error (I think it does for UMAP). We do document that reasonable values start at 2, but it could also be good to have more reasoning on that.