Open flying-sheep opened 9 months ago
This is not a bug but intended. cuML behaves like scikit-learn. When I started working on neighbors I also went for (n_neighbors - 1) * n_obs
. But for rsc n_neighbors * n_obs
is important for the umap shortcut. I can think about adding in the 0 zeros in UMAP and mimic PyNNDescent.
No need, I just wanted to start the discussion about what is “correct”.
Seems like this isn’t so easy to answer, see relevant note from that linked document:
Note
When a specific number of neighbors is queried (using KNeighborsTransformer), the definition of
n_neighbors
is ambiguous since it can either include each training point as its own neighbor, or exclude them. Neither choice is perfect, since including them leads to a different number of non-self neighbors during training and testing, while excluding them leads to a difference betweenfit(X).transform(X)
andfit_transform(X)
, which is against scikit-learn API. In KNeighborsTransformer we use the definition which includes each training point as its own neighbor in the count ofn_neighbors
. However, for compatibility reasons with other estimators which use the other definition, one extra neighbor will be computed whenmode == 'distance'
. To maximise compatibility with all estimators, a safe choice is to always include one extra neighbor in a custom nearest neighbors estimator, since unnecessary neighbors will be filtered by following estimators.
I assume “one extra neighbor” means “per cell, calculate n_neighbors
neighbors that aren’t the same cell”, resulting in a sparse matrix with n_neighbors
non-zero entries per row. Our code then creates knn_*
matrices with n_neighbors
columns that include that “additional” neighbor, but not the implicit, zero-distance, “self” neighbor.
So if PyNNDescent has n_neighbors - 1
entries per row in the sparse matrix, that means it doesn’t follow the sklearn advice and interprets n_neighbors
as “number of neighbors including the same cell”. So using your code on PyNNDescent’s output returns two matrices with n_neighbors - 1
columns each.
cc @eroell @ivirshup
We’ve been thinking about checking if Scanpy’s neighbors backends can benefit from the shortcut, but it seems like PyNNDescent behaves differently (
(n_neighbors - 1) * n_obs
instead ofn_neighbors * n_obs
)Shouldn’t the two match up in behavior? Is this a cupy bug? I’m unsure where to harmonize this.
https://github.com/scverse/rapids_singlecell/blob/b038bcc923b4dd3d4874bae681b0f2cb919bccbf/src/rapids_singlecell/preprocessing/_neighbors.py#L131-L132
https://github.com/scverse/scanpy/blob/616d580384c1daa8115a270220aafd8c09f9f0f1/scanpy/neighbors/_common.py#L64-L81