scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.94k stars 604 forks source link

Adding Moran's I calculation to Scanpy #1698

Closed Hrovatin closed 3 years ago

Hrovatin commented 3 years ago

Could you add Moran's I calculation to Scanpy? It could be used in scIB and to also find variable genes across embedding (could be an alternative to SEMITONES that takes a while to be computed).

giovp commented 3 years ago

hi @Hrovatin so was it useful for your task? Curious to hear. Couple of questions:

Hrovatin commented 3 years ago

Yes, it works great for me. I can compare scores obtained on individual samples and on integrated data to show that genes that are spatially variable in samples remain such on integrated data. However, with default n_iters most pvalues were 0 (for my known marker set), but calculating more iters would take too long, so I might just use the I-score where possible. I would like to add Moran's I as an bio conservation integration metric to scIB - this is for me the only metric that does not require cell subtype annotation (which is cumbersome and unreliable procedure) and it performs similar to current scIB metrics. However, scIB has as dependency only scanpy, not squidpy. It seems a bit of an overkill to add package dependency to scIB for a single function.

Semitones (https://www.biorxiv.org/content/10.1101/2020.11.17.386664v1) is a package for finding genes linearly variable across embedding and I think Moran's I would also give me similar genes (must try it out) - Moran's I might be even better for the task and quicker + less complicated. This is another reason why it would be neat to have Moran's I directly in scanpy. You may not have spatial data, so not really needing squidpy. But finding gene patterns may be useful when you have continuous effects but no trajectories - this is what my main beta cell subtype analysis is currently based on.

ivirshup commented 3 years ago

I think Morans I would be a good addition to scanpy, fits well with Gearys C (https://github.com/theislab/scanpy/pull/915)

giovp commented 3 years ago

I see, now I understand the usage better as a metric, thank you! I'd suggest that this is more of a feature request for Scanpy then @Hrovatin , feel free to open one or contribute to (theislab/scanpy#915) !

ivirshup commented 3 years ago

Transfered to scanpy

giovp commented 3 years ago

👀 crazy didn't know it was possible!

LuckyMD commented 3 years ago

This would be a good fit for a potential sc.metrics module as discussed recently (can't find the issue anymore)

ivirshup commented 3 years ago

@LuckyMD I think we talked about it in the PR linked above.

So, what needs to happen here? Is it just copying and pasting some code from squidpy, or is there more implementation work to do?

ivirshup commented 3 years ago

Looks like gearys_c and morans_i will be out in 1.8 (morans_i should be on master pretty soon).

@Hrovatin, do you have any thoughts on the usefulness of calculating a p-value vs just calculating I? How could we tell whether the statistic could be influenced by the structure of the graph, or non-spatial distribution of the values? Also, maybe there's a faster way to calculate it than with a fixed number of iterations, especially for structured grids like we have with visium data?

giovp commented 3 years ago

sorry, late reply on the P-values @ivirshup as for your question in #1740 , I don't think pvalues are needed in this case, the same way for geary's C , I'd assume the usage of Moran's is the same as geary's C and it's just to provide an effect size for whatever effect you are investigating, In literature I found that an alternative to permutation-based pval is analytic (assuming nromality) but they suggest to "row-normalize" the weights (i.e. the knn graph).

Hrovatin commented 3 years ago

For me the p-values with 1000 permutations were not useful as they were all 0 for the genes of interest, so could not discriminate the genes. But I do not know how comparable are I-s directly - I used it only on a single dataset with different KNN graphs, which should make it more comparable I would assume.

ivirshup commented 3 years ago

I was thinking about these in the context of of feature selection, where you may want a principled cutoff for inclusion. From looking at this in one visium datasets and one single cell dataset. It looks like expected value for any gene with a high morans I were quite low. This was not the case for Geary's C on the umap connectivity with single cell data.

Here are some plots around this. Values from permuting the order are in blue, measured values are in black. This only shows the genes which were in the 95th percentile of scores. I inverted the values of gearys C so it was easier to compare with morans I. The x-axis is score between 0 and 1, the y axis is gene rank.

It's pretty clear there is much greater dispersion of expected value for Geary's C

Morans I UMAP connectivity ![image](https://user-images.githubusercontent.com/8238804/112266866-bac8c600-8cc8-11eb-96bc-922256b7e52e.png)
Geary's C UMAP connectivity ![image](https://user-images.githubusercontent.com/8238804/112266847-b3092180-8cc8-11eb-8e1a-56b26c6bfe23.png)
Morans I spatial connectivity ![image](https://user-images.githubusercontent.com/8238804/112269342-5b6cb500-8ccc-11eb-8339-b0b9512a5081.png)
Geary's C spatial connectivity ![image](https://user-images.githubusercontent.com/8238804/112266893-c5835b00-8cc8-11eb-931e-0169ccc0471f.png)

Comparing distribution of scores for the single cell PBMC data:

image

My current thinking is that Gearys C is more sensitive to sparse features, and may be more in need of significance testing. I think this is not as visible for visium data since features are less sparse.

Hrovatin commented 3 years ago

I started doing feature (gene) selection on scSeq data as well. I also observe dependency between Moran's I and gene expression sparsity. I was thinking that maybe I could regress out the expression effect or select genes per bin. image And this was the problem with p-values in squdpy-s Moran's I (plot for first 100 genes in my adata, using 100 permutations - but this should not really lead to low pvalues as I think that the reported pvalues are estimated based on null distn shape - not 100% sure though). image

giovp commented 3 years ago

My current thinking is that Gearys C is more sensitive to sparse features, and may be more in need of significance testing. I think this is not as visible for visium data since features are less sparse.

Really interesting comparison @ivirshup . I think it makes sense that is more sensible to sparsity because the score is not computed "against a mean" but against the neighborhood graph. In some sense, Moran's I is more smooth.

And this was the problem with p-values in squdpy-s Moran's I (plot for first 100 genes in my adata, using 100 permutations - but this should not really lead to low pvalues as I think that the reported pvalues are estimated based on null distn shape - not 100% sure though).

indeed, they are computed against the null distribution that is computed from permutations.

As a personal opinion, I don't think reporting p values for these type of statistics is very useful (as in, I personally wouldn't draw conclusions on significance, but more on effect size, and this holds true for a t-test as well imho...). I should also say that for squidpy we might want to compute p-values out of completeness... but again I would refrain from looking at them. In that case I also think it's more fair since the graph comes from another source, and is not computed from gexp similarity.

Hrovatin commented 3 years ago

Maybe row-scaling of connectives should be an option to ensure that morans I stays between [-1,1] as else it may get very large values on some data.


morans_i=sc.metrics.morans_i(adata_pb_c)
morans_i=pd.Series(morans_i,index=adata_pb_c.var_names)
print(morans_i.describe())
count     1.412000e+04
mean     3.701341e+277
std                inf
min      -5.246069e+05
25%       1.087433e-10
50%       4.935376e-01
75%       1.005601e+00
max      4.440351e+281
dtype: float64

# Row scaling so that I is between -1 and 1
adata_pb_c_temp=adata_pb_c.copy()
adata_pb_c_temp.obsp['connectivities']=csr_matrix(minmax_scale(
    adata_pb_c_temp.obsp['connectivities'].todense(),axis=1))
morans_i=sc.metrics.morans_i(adata_pb_c_temp)
morans_i=pd.Series(morans_i,index=adata_pb_c.var_names)
del adata_pb_c_temp
print(morans_i.describe())

count     1.412000e+04
mean      1.737230e-01
std       2.733991e-01
min       0.000000e+00
25%      7.163952e-322
50%      1.462434e-321
75%       4.578002e-01
max       9.320416e-01
dtype: float64
Hrovatin commented 3 years ago

Also, is there some randomness? I get different results on different runs.

giovp commented 3 years ago

Hi @Hrovatin , there shouldn't be randomness like you are describing no, can you try to post a reprex? Re normalization, I could add it to both moran's and geary's, what do you think @ivirshup ?

ivirshup commented 3 years ago

I think it would be generally useful to have some methods for re-weighting graphs (some discussion in #1561), I'm not sure if keyword arguments per function is the right way to do this. Maybe something in a graph module or preprocessing?

My understanding (source) was the way to scale weights for Morans I is to be sure the edge weights are between 0 and 1. Is this right? This should already be the case for umap based connectivities. Is this the graph you were using @Hrovatin? I would definitely like to see the data you used here.

Btw, you could use maxabs_scale on a sparse matrix to prevent densification.

Hrovatin commented 3 years ago

There is a sentence at the bottom of https://www.uni-kassel.de/fb07/fileadmin/datas/fb07/5-Institute/IVWL/Kosfeld/lehre/spatial/SpatialEconometrics2.pdf slide 8 about this, but it is not very clear. In my case on this specific data scaling each row from 0 to 1 lead to I in expected range. But then sometimes I get the values in expected range also without scaling. maybe these extreme values I was getting were due to the variability problem.

Regarding how to repeat it: I currently have an odd dataset and if I run my jupytyer cell multiple times I sometimes get different results. It is odd. Sometimes also I's are within range [-1,1] and sometimes they explode.

image

Sometimes the differences are very extreme:

image
ivirshup commented 3 years ago

Could you share the dataset?

Hrovatin commented 3 years ago

Are you on our lab's Mattermost? I can not find you there.

Hrovatin commented 3 years ago

Is there an email where I can send it? I can not post it here @ivirshup

ivirshup commented 3 years ago

Ha, no I'm not on the mattermost. ivirshup@gmail.com

Though if you could cut down the dataset to something small and more shareable that would be helpful as well.

ivirshup commented 3 years ago

I can replicate. This is so weird. Sometimes it returns reasonable results.

Previous investigation This data is a bit strange (is it important for `X` to be all negative?). Interestingly, it also breaks `sc.pp.scale`. Any insights you can provide on the data would be helpful. I was seeing this kind of issue before with parallelization, where we were running into a numba bug and being returned the uninitialized array as a result. Some other things I've been noticing: * If I calculate morans I on the first 100 variables, the 45th variable is the only one who's value changes * This is especially weird since all values are changing if I run the function on the full set of variables * If I use a smaller interval size (10), I don't get any varying results * If I run morans I for each value individually, I also don't get any varying results I'm now checking to see if I take random subsample of the variables, compute morans I for those values together, is it always the same variables which are inconsistent? ----------------- This is so weird. ```python import scanpy as sc import numpy as np import pandas as pd from functools import partial def check_subset(g, X, idx, tries=5, func=sc.metrics.morans_i): sub = X[idx] result = np.ones(sub.shape[0], dtype=bool) first = func(g, sub) for i in range(tries): result &= (first == func(g, sub)) return result morans_i = partial(check_subset, adata.obsp["connectivities"], adata.X.T.copy(), func=sc.metrics.morans_i) # Take adata.n_vars samples of 100 variables each samples = np.random.choice(adata.n_vars, (adata.n_vars, 100)) # This takes a while results = np.vstack([morans_i(samples[idx]) for idx in range(adata.n_vars)]) df = pd.DataFrame({ "var_idx": samples.flatten(), "consistent": results.flatten(), "sample": np.repeat(np.arange(adata.n_vars), 100), "order": np.tile(np.arange(100), adata.n_vars), }) df.groupby("order").mean()["consistent"].plot() ``` ![image](https://user-images.githubusercontent.com/8238804/116065566-7e72f600-a6ca-11eb-9ad6-5c991ed79910.png) ---------------------

Progress!

Minimal reproducer:

pbmc = sc.datasets.pbmc68k_reduced()
pbmc = pbmc[:411].copy()
sc.pp.neighbors(pbmc)
res = [sc.metrics.morans_i(pbmc.obsp["connectivities"], pbmc.X.T) for i in range(3)]

np.equal(res[0], res[1])

I can trigger the bug by changing the dataset size.

I think this may be a memory alignment issue.

If X is a sparse matrix, I don't get any inconsistency in the results (which would make sense for an alignment issue).

ivirshup commented 3 years ago

I think you have some variables which are the same for all samples. This leads to a division by zero error, which numba is not handling gracefully or mentioning.

Here's how to find those values:

np.where((adata.X[[0], :] == adata.X).all(axis=0))

I believe if you filter these out, this should work.

I'm not sure if there is a correct value for Morans I or Gearys C in this case. Should we error?


Numba bug report: https://github.com/numba/numba/issues/6976

Hrovatin commented 3 years ago

Thank you! This data is from Compass (they term this consistencies). They are negative, but as you already know the same happened when scaling to [0,1] per feature. Will remove those constant values. I think the result in such a case could just contain nans and emit a warning.

p.s.: I really like how you document everything you do so nicely 😃

ivirshup commented 3 years ago

I think the result in such a case could just contain nans and emit a warning.

This sounds reasonable to me.

With sparse values, it's consistently giving results, but it's the wrong results. The iteration is being chunked (probably related to number of available cores), and it looks like within each chunk all values after the constant one are filled with zeros. I should look into whether this is also numba, or a logic bug. If it's numba, it's strange that it's zeros and not inf or nan. If it's on our end, I'm not sure why the later iterations would be skipped.


p.s.: I really like how you document everything you do so nicely 😃

Thanks! Is mostly so I can remember my reasoning a month later 😊

ivirshup commented 3 years ago

@Hrovatin, if you haven't read it yet I think you would find the Hotspot method from the Yosef lab interesting. It uses something similar to Morans I for feature selection and local Morans I for gene module detection. They use parametric null models to get significances for their scores, which would be significantly faster than permutation testing. I'm a little unsure of how the parametric null models correspond to the non-parametric ones since the only comparison I've found so far is some Q-Q plots in the supp.