Closed Hrovatin closed 3 years ago
hi @Hrovatin so was it useful for your task? Curious to hear. Couple of questions:
Yes, it works great for me. I can compare scores obtained on individual samples and on integrated data to show that genes that are spatially variable in samples remain such on integrated data. However, with default n_iters most pvalues were 0 (for my known marker set), but calculating more iters would take too long, so I might just use the I-score where possible. I would like to add Moran's I as an bio conservation integration metric to scIB - this is for me the only metric that does not require cell subtype annotation (which is cumbersome and unreliable procedure) and it performs similar to current scIB metrics. However, scIB has as dependency only scanpy, not squidpy. It seems a bit of an overkill to add package dependency to scIB for a single function.
Semitones (https://www.biorxiv.org/content/10.1101/2020.11.17.386664v1) is a package for finding genes linearly variable across embedding and I think Moran's I would also give me similar genes (must try it out) - Moran's I might be even better for the task and quicker + less complicated. This is another reason why it would be neat to have Moran's I directly in scanpy. You may not have spatial data, so not really needing squidpy. But finding gene patterns may be useful when you have continuous effects but no trajectories - this is what my main beta cell subtype analysis is currently based on.
I think Morans I would be a good addition to scanpy, fits well with Gearys C (https://github.com/theislab/scanpy/pull/915)
I see, now I understand the usage better as a metric, thank you! I'd suggest that this is more of a feature request for Scanpy then @Hrovatin , feel free to open one or contribute to (theislab/scanpy#915) !
Transfered to scanpy
👀 crazy didn't know it was possible!
This would be a good fit for a potential sc.metrics
module as discussed recently (can't find the issue anymore)
@LuckyMD I think we talked about it in the PR linked above.
So, what needs to happen here? Is it just copying and pasting some code from squidpy, or is there more implementation work to do?
Looks like gearys_c
and morans_i
will be out in 1.8
(morans_i
should be on master pretty soon).
@Hrovatin, do you have any thoughts on the usefulness of calculating a p-value vs just calculating I
? How could we tell whether the statistic could be influenced by the structure of the graph, or non-spatial distribution of the values? Also, maybe there's a faster way to calculate it than with a fixed number of iterations, especially for structured grids like we have with visium data?
sorry, late reply on the P-values @ivirshup as for your question in #1740 , I don't think pvalues are needed in this case, the same way for geary's C , I'd assume the usage of Moran's is the same as geary's C and it's just to provide an effect size for whatever effect you are investigating, In literature I found that an alternative to permutation-based pval is analytic (assuming nromality) but they suggest to "row-normalize" the weights (i.e. the knn graph).
For me the p-values with 1000 permutations were not useful as they were all 0 for the genes of interest, so could not discriminate the genes. But I do not know how comparable are I-s directly - I used it only on a single dataset with different KNN graphs, which should make it more comparable I would assume.
I was thinking about these in the context of of feature selection, where you may want a principled cutoff for inclusion. From looking at this in one visium datasets and one single cell dataset. It looks like expected value for any gene with a high morans I were quite low. This was not the case for Geary's C on the umap connectivity with single cell data.
Here are some plots around this. Values from permuting the order are in blue, measured values are in black. This only shows the genes which were in the 95th percentile of scores. I inverted the values of gearys C so it was easier to compare with morans I. The x-axis is score between 0 and 1, the y axis is gene rank.
It's pretty clear there is much greater dispersion of expected value for Geary's C
Comparing distribution of scores for the single cell PBMC data:
My current thinking is that Gearys C is more sensitive to sparse features, and may be more in need of significance testing. I think this is not as visible for visium data since features are less sparse.
I started doing feature (gene) selection on scSeq data as well. I also observe dependency between Moran's I and gene expression sparsity. I was thinking that maybe I could regress out the expression effect or select genes per bin. And this was the problem with p-values in squdpy-s Moran's I (plot for first 100 genes in my adata, using 100 permutations - but this should not really lead to low pvalues as I think that the reported pvalues are estimated based on null distn shape - not 100% sure though).
My current thinking is that Gearys C is more sensitive to sparse features, and may be more in need of significance testing. I think this is not as visible for visium data since features are less sparse.
Really interesting comparison @ivirshup . I think it makes sense that is more sensible to sparsity because the score is not computed "against a mean" but against the neighborhood graph. In some sense, Moran's I is more smooth.
And this was the problem with p-values in squdpy-s Moran's I (plot for first 100 genes in my adata, using 100 permutations - but this should not really lead to low pvalues as I think that the reported pvalues are estimated based on null distn shape - not 100% sure though).
indeed, they are computed against the null distribution that is computed from permutations.
As a personal opinion, I don't think reporting p values for these type of statistics is very useful (as in, I personally wouldn't draw conclusions on significance, but more on effect size, and this holds true for a t-test as well imho...). I should also say that for squidpy we might want to compute p-values out of completeness... but again I would refrain from looking at them. In that case I also think it's more fair since the graph comes from another source, and is not computed from gexp similarity.
Maybe row-scaling of connectives should be an option to ensure that morans I stays between [-1,1] as else it may get very large values on some data.
morans_i=sc.metrics.morans_i(adata_pb_c)
morans_i=pd.Series(morans_i,index=adata_pb_c.var_names)
print(morans_i.describe())
count 1.412000e+04
mean 3.701341e+277
std inf
min -5.246069e+05
25% 1.087433e-10
50% 4.935376e-01
75% 1.005601e+00
max 4.440351e+281
dtype: float64
# Row scaling so that I is between -1 and 1
adata_pb_c_temp=adata_pb_c.copy()
adata_pb_c_temp.obsp['connectivities']=csr_matrix(minmax_scale(
adata_pb_c_temp.obsp['connectivities'].todense(),axis=1))
morans_i=sc.metrics.morans_i(adata_pb_c_temp)
morans_i=pd.Series(morans_i,index=adata_pb_c.var_names)
del adata_pb_c_temp
print(morans_i.describe())
count 1.412000e+04
mean 1.737230e-01
std 2.733991e-01
min 0.000000e+00
25% 7.163952e-322
50% 1.462434e-321
75% 4.578002e-01
max 9.320416e-01
dtype: float64
Also, is there some randomness? I get different results on different runs.
Hi @Hrovatin , there shouldn't be randomness like you are describing no, can you try to post a reprex? Re normalization, I could add it to both moran's and geary's, what do you think @ivirshup ?
I think it would be generally useful to have some methods for re-weighting graphs (some discussion in #1561), I'm not sure if keyword arguments per function is the right way to do this. Maybe something in a graph
module or preprocessing
?
My understanding (source) was the way to scale weights for Morans I is to be sure the edge weights are between 0 and 1. Is this right? This should already be the case for umap based connectivities. Is this the graph you were using @Hrovatin? I would definitely like to see the data you used here.
Btw, you could use maxabs_scale
on a sparse matrix to prevent densification.
There is a sentence at the bottom of https://www.uni-kassel.de/fb07/fileadmin/datas/fb07/5-Institute/IVWL/Kosfeld/lehre/spatial/SpatialEconometrics2.pdf slide 8 about this, but it is not very clear. In my case on this specific data scaling each row from 0 to 1 lead to I in expected range. But then sometimes I get the values in expected range also without scaling. maybe these extreme values I was getting were due to the variability problem.
Regarding how to repeat it: I currently have an odd dataset and if I run my jupytyer cell multiple times I sometimes get different results. It is odd. Sometimes also I's are within range [-1,1] and sometimes they explode.
Sometimes the differences are very extreme:
Could you share the dataset?
Are you on our lab's Mattermost? I can not find you there.
Is there an email where I can send it? I can not post it here @ivirshup
Ha, no I'm not on the mattermost. ivirshup@gmail.com
Though if you could cut down the dataset to something small and more shareable that would be helpful as well.
I can replicate. This is so weird. Sometimes it returns reasonable results.
Progress!
Minimal reproducer:
pbmc = sc.datasets.pbmc68k_reduced()
pbmc = pbmc[:411].copy()
sc.pp.neighbors(pbmc)
res = [sc.metrics.morans_i(pbmc.obsp["connectivities"], pbmc.X.T) for i in range(3)]
np.equal(res[0], res[1])
I can trigger the bug by changing the dataset size.
I think this may be a memory alignment issue.
If X
is a sparse matrix, I don't get any inconsistency in the results (which would make sense for an alignment issue).
I think you have some variables which are the same for all samples. This leads to a division by zero error, which numba is not handling gracefully or mentioning.
Here's how to find those values:
np.where((adata.X[[0], :] == adata.X).all(axis=0))
I believe if you filter these out, this should work.
I'm not sure if there is a correct value for Morans I or Gearys C in this case. Should we error?
Numba bug report: https://github.com/numba/numba/issues/6976
Thank you! This data is from Compass (they term this consistencies). They are negative, but as you already know the same happened when scaling to [0,1] per feature. Will remove those constant values. I think the result in such a case could just contain nans and emit a warning.
p.s.: I really like how you document everything you do so nicely 😃
I think the result in such a case could just contain nans and emit a warning.
This sounds reasonable to me.
With sparse values, it's consistently giving results, but it's the wrong results. The iteration is being chunked (probably related to number of available cores), and it looks like within each chunk all values after the constant one are filled with zeros. I should look into whether this is also numba, or a logic bug. If it's numba, it's strange that it's zeros and not inf
or nan
. If it's on our end, I'm not sure why the later iterations would be skipped.
p.s.: I really like how you document everything you do so nicely 😃
Thanks! Is mostly so I can remember my reasoning a month later 😊
@Hrovatin, if you haven't read it yet I think you would find the Hotspot method from the Yosef lab interesting. It uses something similar to Morans I for feature selection and local Morans I for gene module detection. They use parametric null models to get significances for their scores, which would be significantly faster than permutation testing. I'm a little unsure of how the parametric null models correspond to the non-parametric ones since the only comparison I've found so far is some Q-Q plots in the supp.
Could you add Moran's I calculation to Scanpy? It could be used in scIB and to also find variable genes across embedding (could be an alternative to SEMITONES that takes a while to be computed).