Spatial statistics - Phuong

giovp commented 3 years ago

download a data of choice from 10x genomics (you can browse them here)
- find a way to load them in R with this
- otherwise, download them and load them in python with sc.datasets_visium_sge()
run a tutorial for data analysis
- in R: https://satijalab.org/seurat/v3.2/spatial_vignette.html
- in python: https://scanpy-tutorials.readthedocs.io/en/latest/spatial/basic-analysis.html
at the end of the analysis, make sure you have run differntial expression analysis
- if in python, the function is: sc.tl.rank_genes_group

Then we'll discuss next week about spatial statistics tools. In R you can try this: https://github.com/xzhoulab/SPARK

@ngocphuongtran

giovp commented 3 years ago

so, after having installed squidpy, you can compute spatial staitsics with

import squidpy as sq
sq.gr.moran(adata, n_jobs=4)

the results are saved in

adata.uns["moranI"].head(10)

and already sorted by I statistiscs.

For a nice blog post on Moran's I check this out

you can visualize results with standard sc.pl.spatial function, e.g.

sc.pl.spatial(adata, color=["Nrgn", "Camk2n1", "Mobp", "cluster"])

a very first intereting analysis, that I'm genuinely interested in is: how many significant SVG (so filter by pval) found by Moran's I are also significant marker genes (after sc.tl.rank_genes_groups(adata, groupby="<your cluster>").

Then you can for instance rank clusters based on # overlapping genes between the two statiscsit. I would expect that clusters that are not randomly dispersed across tissue have higher overlap with SVG (becuase SVG tends to select non random effects).

Let me know!

ngocphuongtran commented 3 years ago

Hi Giovanni,

after installing squidpy the number of clusters from last week have changed. Is that normal? I would have expect that the results stay the same. Do you know how I can make sure to get the same results?

Looking at the blog post, I should have got five values: the Moran's I Index, Expected Index, Variance, z-score, and p-value. But running adata.uns["moranI"].head(10), I get the following:

Can you explain the column names, please?

I may have some wrong understanding about some terminology. Could you explain what is meant by a highly expressed gene, a marker gene and a spatially variable gene?

Thank you for your help!

Best, Phuong

giovp commented 3 years ago

Hi,

after installing squidpy the number of clusters from last week have changed. Is that normal? I would have expect that the results stay the same. Do you know how I can make sure to get the same results?

mmh If random seeds were not set in respective function, this is not given. Althogh usually stay the same, it can happen

Looking at the blog post, I should have got five values: the Moran's I Index, Expected Index, Variance, z-score, and p-value. But running adata.uns["moranI"].head(10), I get the following:

good point, I should make it return all of them, but also I think the only real important ones are the I statistics, and the pvalue (both raw and adjusted with e//g bh fdr)

Looking at the blog post, I should have got five values: the Moran's I Index, Expected Index, Variance, z-score, and p-value. But running adata.uns["moranI"].head(10), I get the following:

I statistsics, pvalue raw, variance, pval adjusted. For I and var, a simulation based approach (permutation based ith 1k permutations) is performed.

I may have some wrong understanding about some terminology. Could you explain what is meant by a highly expressed gene, a marker gene and a spatially variable gene?

highly expressed genes: genes that are expressed a lot
marker genes: significant genes after sc.tl.rank_genes_groups()
spatially variable genes: signfiicant genes after sq.graph.moran()

hth, let me know for more infos

giovp commented 3 years ago

conversion between anndata and scexperiment: https://theislab.github.io/zellkonverter/

you can then use spark xzhoulab/SPARK

ngocphuongtran commented 3 years ago

Hi Giovanni,

I cannot install SPARK. I tried it on two different ways but no one seems to work. install.packages('devtools') library(devtools) devtools::install_github('xzhoulab/SPARK') and install.packages("Rcpp") install.packages("remotes") remotes::install_github("xzhoulab/SPARK") Both gave me the same Error.

I haven't found anything on the internet. Do you have any advise?

Thank you for your help!

Best, Phuong

giovp commented 3 years ago

mmh seems compilation issues. Maybe submit an issue in the repo? they coudl help not sure sorry.

otherwise you could try this, it'a quit an interesting idea, and there seems to be nice tutorials. But it's in python: https://github.com/ManchesterBioinference/GPcounts

ngocphuongtran commented 3 years ago

Hi Giovanni,

I am trying to use GPcounts with Negative Binomial likelihood to determine SVG. Hence I had to calculate the scale factors, which can be determined by performing negative binomial regression in R. But there is one error I cannot fix.

Do you have any advise?

I have uploaded the Code in my branch. Thank you for your help!

Best, Phuong

giovp commented 3 years ago

Hence I had to calculate the scale factors, which can be determined by performing negative binomial regression in R. But there is one error I cannot fix.

I think you are on the right track , and nice you managed to try out GP counts!

for size factors, you can just use the total number of counts per cells. You should have that by simply doing scanpy.pp.calculate_qc_metrics() and it should be saved under adata.obs["total_counts"].

Could you just use that? I see the point of estimating with the nb glm, but that one should be a good approximation. Alternatively, I'd have to get a glimpse of your counts object in R before attempting an explanation. But nice try!

ngocphuongtran commented 3 years ago

Thanks for the hint. I will try it.

This is a part of my count object:

The last column which you can't see here is the total number of counts per cell.

I also uploaded a csv file with the entire dataframe in my branch. The first column is actually the row names.

Best, Phuong

ngocphuongtran commented 3 years ago

Hi Giovanni,

Unfortunately, I still cannot compute the SVG. I come up with a new problem which I cannot fix.

Do you have any idea?

Thank you for your help!

Best, Phuong

giovp commented 3 years ago

eh no sorry, this is also something that is specific to that library. Probably opening an issue there could be of help sorry

ngocphuongtran commented 3 years ago

Hi Giovanni,

the people from the GPcounts package helped me fixing the code. They said that the scales variable should have the same dimension as the gene expression counts matrix, which was not the case for their data. So I tried also tried it with 100 columns. Then it took me 2:05 hours to compute the first genes. I'm not sure if it will work because there was already an error. What do you think?

This is only the time for the first 20 genes but I have 3138 ones.

Do you know if it's possible to calculate them faster(does gpu work?)?

Also they just published some code to compute the scale factors with python. I also like to try this but that will also cost a lot of computational time.

Thank you for your help!

Best, Phuong

giovp commented 3 years ago

Ihad a look at the issue and in their first reply, they show a way to speed it up:

gene_name = []
sparse = True
nb_scaled = True # set the nb_scaled argument to True to pass the scale factors
gene_name = Y_run.index
likelihood = 'Negative_binomial'
gp_counts = Fit_GPcounts(X,Y_run.loc[gene_name], scale = scale, sparse = sparse,nb_scaled=nb_scaled,safe_mode=False)
gp_counts.ConditionalVariance_inducing_points( M = 20)

the key part is when you instatntiate the gp_counts object

gp_counts = Fit_GPcounts(X,Y_run.loc[gene_name], scale = scale, sparse = sparse,nb_scaled=nb_scaled,safe_mode=False)

basically setting sparse=True and then

gp_counts.ConditionalVariance_inducing_points( M = 20)

setting a fixed number of inducing points. Can you try with that? it should be faster. Also, re genes, you probably want to run it only on highvly variable

genes = adata[:,adata.var.highly_variable].var_names.values

so you ahve less genes to evaluate.

looking forward to see results!

ngocphuongtran commented 3 years ago

Hi Giovanni,

I just uploaded the GPcounts notebook. It's very similar to the original. Because it is computationally expensive, I saved the scale factors in a csv file. But the file is too big to upload (even the zipped one). Do you know how to upload it anyways?

I have normalized the data twice (sc.pp.normalize_total and the scale factors). Do you think it is a problem? (the scale factors were determined after the first normalization)

Thank you for your help!

Best, Phuong

giovp / sclecture21

Spatial statistics - Phuong #2