scverse / rapids_singlecell

Rapids_singlecell: A GPU-accelerated tool for scRNA analysis. Offers seamless scverse compatibility for efficient single-cell data processing and analysis.
https://rapids-singlecell.readthedocs.io/
MIT License
113 stars 19 forks source link

[neighbors] sparse matrix to ind/dist matrix pair: same as PyNNDescent? #80

Open flying-sheep opened 9 months ago

flying-sheep commented 9 months ago

We’ve been thinking about checking if Scanpy’s neighbors backends can benefit from the shortcut, but it seems like PyNNDescent behaves differently ((n_neighbors - 1) * n_obs instead of n_neighbors * n_obs)

Shouldn’t the two match up in behavior? Is this a cupy bug? I’m unsure where to harmonize this.

https://github.com/scverse/rapids_singlecell/blob/b038bcc923b4dd3d4874bae681b0f2cb919bccbf/src/rapids_singlecell/preprocessing/_neighbors.py#L131-L132

https://github.com/scverse/scanpy/blob/616d580384c1daa8115a270220aafd8c09f9f0f1/scanpy/neighbors/_common.py#L64-L81

Intron7 commented 9 months ago

This is not a bug but intended. cuML behaves like scikit-learn. When I started working on neighbors I also went for (n_neighbors - 1) * n_obs. But for rsc n_neighbors * n_obs is important for the umap shortcut. I can think about adding in the 0 zeros in UMAP and mimic PyNNDescent.

flying-sheep commented 9 months ago

No need, I just wanted to start the discussion about what is “correct”.

Seems like this isn’t so easy to answer, see relevant note from that linked document:

Note

When a specific number of neighbors is queried (using KNeighborsTransformer), the definition of n_neighbors is ambiguous since it can either include each training point as its own neighbor, or exclude them. Neither choice is perfect, since including them leads to a different number of non-self neighbors during training and testing, while excluding them leads to a difference between fit(X).transform(X) and fit_transform(X), which is against scikit-learn API. In KNeighborsTransformer we use the definition which includes each training point as its own neighbor in the count of n_neighbors. However, for compatibility reasons with other estimators which use the other definition, one extra neighbor will be computed when mode == 'distance'. To maximise compatibility with all estimators, a safe choice is to always include one extra neighbor in a custom nearest neighbors estimator, since unnecessary neighbors will be filtered by following estimators.

I assume “one extra neighbor” means “per cell, calculate n_neighbors neighbors that aren’t the same cell”, resulting in a sparse matrix with n_neighbors non-zero entries per row. Our code then creates knn_* matrices with n_neighbors columns that include that “additional” neighbor, but not the implicit, zero-distance, “self” neighbor.

So if PyNNDescent has n_neighbors - 1 entries per row in the sparse matrix, that means it doesn’t follow the sklearn advice and interprets n_neighbors as “number of neighbors including the same cell”. So using your code on PyNNDescent’s output returns two matrices with n_neighbors - 1 columns each.

cc @eroell @ivirshup