Closed alexandregrimaldi closed 1 year ago
Hi, thanks for your interest in scanpy!
I’ll try to comment on your observations here with your code example:
import scanpy as sc
import numpy as np
### Loading and preprocessing data
adata = sc.datasets.pbmc3k_processed()
### Defining scale function
def mean_var(X, axis=0):
mean = np.mean(X, axis=axis, dtype=np.float64)
mean_sq = np.multiply(X, X).mean(axis=axis, dtype=np.float64)
var = mean_sq - mean**2
# enforce R convention (unbiased estimator) for variance
var *= X.shape[axis] / (X.shape[axis] - 1)
return mean, var
As a first note of caution, in your code your function actually modifies the original data matrix, of the scanpy object - which is used again later in the snippet.
→ We should create a copy of X
. Else the code overwrites this object, and ends up comparing an object with itself, while simply using two names for it (this caused your ==
comparisons to evaluate as True
, but is not what you intend to test).
def my_scale_function(X, clip=False):
# need to make a copy of X
Y = X.copy()
mean, var = mean_var(Y, axis=0)
Y -= mean
std = np.sqrt(var)
#std[std == 0] = 1
Y /= std
if clip:
Y = np.clip(X, -10, 10)
return np.matrix(Y)
As a second note of caution, floating point numbers should not be compared with the ==
operator (see for example here).
→ A more common way would be to use e.g. np.allclose()
for this purpose.
### Scanpy scale vs my_scale_function.
print("Rescaled with my_scale_function:")
mtx_rescaled = my_scale_function(adata.X)
print("Do a numpy check for closeness of floats:")
print(np.allclose(adata.X, mtx_rescaled))
Do a numpy check for closeness of floats:
False
You can see that this test actually fails. This is because not all genes appear scaled, and your function now actually is doing that.
adata.X.var(0)
array([0.9996213 , 0.97964925, 0.29805112, ..., 0.78701097, 0.9980862 ,
0.9996219 ], dtype=float32)
This could happen if e.g. cells were used to scale gene expression, which were later discarded in quality control. So when calling my_scale_function
or sc.pp.scale
, we expect the cell-by-gene matrix to change at first
mtx_rescaled_sc = sc.pp.scale(adata.X, copy=True)
print("Do a numpy check for closeness of floats:")
print(np.allclose(adata.X, mtx_rescaled_sc))
Do a numpy check for closeness of floats:
False
But not anymore if we call sc.pp.scale
again.
mtx_rescaled_sc_II = sc.pp.scale(mtx_rescaled_sc, copy=True)
print("Do a numpy check for closeness of floats:")
print(np.allclose(mtx_rescaled_sc, mtx_rescaled_sc_II))
Do a numpy check for closeness of floats:
True
This is the behaviour which we would expect: I also think that the UMAPs generated should be reproducible. Hope this helps!
Thank you again for bringing up this issue!
Based on the provided information and the discussion so far, it seems that the question has been addressed.
However, please don't hesitate to reopen this issue or create a new one if you have any more questions or run into any related problems in the future.
Thanks for being a part of our community! :)
Please make sure these conditions are met
What happened?
Hi! I am not sure if this is a bug... Every time I rescale the
pbmc3k_processed
matrix using it as input in the scanpysc.pp.scale
function, I get a very slightly different matrix in output, enough to generate a different UMAP with each run. But if I rewrite it using numpy in a simple function calledmy_scale_function
it outputs the exact same matrix as the input, generating the same UMAP down the line...Could someone explain to me what is happening? (Note: The matrix is not sparse)
Minimal code sample
Error output
Versions