jianhuupenn / SpaGCN

SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network
MIT License
183 stars 55 forks source link

Minimal workflow to compute spatially variable genes #2

Closed davidecrs closed 2 years ago

davidecrs commented 3 years ago

Hi,

I cloned your repo and I'm trying your tutorial to understand a minimal worflow to compute spatially variable genes. But there are some steps that I cannot understand.

This is your code from the jupyter tutorial, in which an AnnData can be created from 10x data

from scanpy import read_10x_h5

adata = read_10x_h5("../tutorial/data/151673/expression_matrix.h5")
spatial=pd.read_csv("../tutorial/data/151673/positions.txt",sep=",",header=None,na_filter=False,index_col=0) 
adata.obs["x1"]=spatial[1]
adata.obs["x2"]=spatial[2]
adata.obs["x3"]=spatial[3]
adata.obs["x4"]=spatial[4]
adata.obs["x5"]=spatial[5]

#Select captured samples
adata=adata[adata.obs["x1"]==1]
adata.var_names=[i.upper() for i in list(adata.var_names)]
adata.var["genename"]=adata.var.index.astype("str")

adata.write_h5ad("../tutorial/data/151673/sample_data.h5ad")

In the last line of code you save this AnnData as h5ad file.

In the section Identify SVGs the tuorial shows the loading of such object

#Read in raw data
raw=sc.read("../tutorial/data/151673/sample_data.h5ad")
raw.var_names_make_unique()
raw.obs["pred"]=adata.obs["pred"].astype('category')
raw.obs["x_array"]=raw.obs["x2"]
raw.obs["y_array"]=raw.obs["x3"]
raw.obs["x_pixel"]=raw.obs["x4"]
raw.obs["y_pixel"]=raw.obs["x5"]
#Convert sparse matrix to non-sparse
raw.X=(raw.X.A if issparse(raw.X) else raw.X)
raw.raw=raw
sc.pp.log1p(raw)

However, all this block of code is not executed successfully because I get the following error

      2 raw=sc.read("../tutorial/data/151673/sample_data.h5ad")
      3 raw.var_names_make_unique()
----> 4 raw.obs["pred"]=adata.obs["pred"].astype('category')
      5 raw.obs["x_array"]=raw.obs["x2"]
      6 raw.obs["y_array"]=raw.obs["x3"]

NameError: name 'adata' is not defined

If I change that adata variable with raw I get the following error

KeyError: 'pred'

So it seems that other intermediate steps are needed probably to insert something in the key pred. Logically it seems that the two steps inserted above can be consequential, as the object is first saved and then reloaded. But I'm not understanding what I'm doing wrong.

So I was wondering if there is any intermediate step is missing in the tutorial where you had to update the h5ad file, or if there is any step that I missed.

Thanks in advance.

Best regards, Davide

jianhuupenn commented 3 years ago

Hi Davide,

The SVG detection is based on the spatial domains, so you need to run spatial domain detection before detecting SVGs. As you can see in section 5.3 Run SpaGCN

#Run
clf.train(adata,adj,init_spa=True,init="louvain",res=res, tol=5e-3, lr=0.05, max_epochs=200)
y_pred, prob=clf.predict()
adata.obs["pred"]= y_pred

That is where we get the "pred".

Hope this solves your problem.

davidecrs commented 2 years ago

Hi @jianhuupenn ,

thanks for your answer, indeed I tried what you suggested, but unfortunately this hasn’t solved my issue because in the end I obtain 0 results in the final table. To give you an idea, I traced the code with a bottom-up approach building the following workflow and obtaining the 0 results. What do you suggest to do? Is there any error I’m not able to see?

Thanks, Davide

from scanpy import read_10x_h5
import pandas as pd
import cv2
import SpaGCN as spg
import numpy as np
import random
import torch
import math
from scipy.sparse import issparse
import scanpy as sc

adata = read_10x_h5("../tutorial/data/151673/expression_matrix.h5")
spatial=pd.read_csv("../tutorial/data/151673/positions.txt",sep=",",header=None,na_filter=False,index_col=0) 
adata.obs["x1"]=spatial[1]
adata.obs["x2"]=spatial[2]
adata.obs["x3"]=spatial[3]
adata.obs["x4"]=spatial[4]
adata.obs["x5"]=spatial[5]
#Select captured samples
adata=adata[adata.obs["x1"]==1]
adata.var_names=[i.upper() for i in list(adata.var_names)]
adata.var["genename"]=adata.var.index.astype("str")

#Set coordinates
adata.obs["x_array"]=adata.obs["x2"]
adata.obs["y_array"]=adata.obs["x3"]
adata.obs["x_pixel"]=adata.obs["x4"]
adata.obs["y_pixel"]=adata.obs["x5"]
x_array=adata.obs["x_array"].tolist()
y_array=adata.obs["y_array"].tolist()
x_pixel=adata.obs["x_pixel"].tolist()
y_pixel=adata.obs["y_pixel"].tolist()

#Calculate adjacent matrix
s=1
b=49
# adj=spg.calculate_adj_matrix(x=x_pixel,y=y_pixel, x_pixel=x_pixel, y_pixel=y_pixel, image=img, beta=b, alpha=s, histology=True)
#If histlogy image is not available, SoaGCN can calculate the adjacent matrix using the fnction below
adj=spg.calculate_adj_matrix(x=x_pixel,y=y_pixel, histology=False)
# np.savetxt('./data/adj.csv', adj, delimiter=',')

p=0.5 
#Find the l value given p
l=spg.search_l(p, adj, start=0.01, end=1000, tol=0.01, max_run=100)

#For this toy data, we set the number of clusters=7 since this tissue has 7 layers
n_clusters=7
#Set seed
r_seed=t_seed=n_seed=100
#Search for suitable resolution
res=spg.search_res(adata, adj, l, n_clusters, start=0.7, step=0.1, tol=5e-3, lr=0.05, max_epochs=20, r_seed=r_seed, t_seed=t_seed, n_seed=n_seed)

#Run
clf=spg.SpaGCN()
clf.set_l(l)
clf.train(adata,adj,init_spa=True,init="louvain",res=res, tol=5e-3, lr=0.05, max_epochs=200)

y_pred, prob=clf.predict()

adata.obs["pred"]= y_pred

raw = adata

raw.var_names_make_unique()
raw.obs["pred"]=adata.obs["pred"].astype('category')
raw.obs["x_array"]=raw.obs["x2"]
raw.obs["y_array"]=raw.obs["x3"]
raw.obs["x_pixel"]=raw.obs["x4"]
raw.obs["y_pixel"]=raw.obs["x5"]
#Convert sparse matrix to non-sparse
raw.X=(raw.X.A if issparse(raw.X) else raw.X)
raw.raw=raw
sc.pp.log1p(raw)

target=0

#Set filtering criterials
min_in_group_fraction=0.8
min_in_out_group_ratio=1
min_fold_change=1.5

#Search radius such that each spot in the target domain has approximately 10 neighbors on average
adj_2d=spg.calculate_adj_matrix(x=x_array, y=y_array, histology=False)
start, end= np.quantile(adj_2d[adj_2d!=0],q=0.001), np.quantile(adj_2d[adj_2d!=0],q=0.1)
r=spg.search_radius(target_cluster=target, cell_id=adata.obs.index.tolist(), x=x_array, y=y_array, pred=adata.obs["pred"].tolist(), start=start, end=end, num_min=10, num_max=14,  max_run=100)

#Detect neighboring domains
nbr_domians=spg.find_neighbor_clusters(target_cluster=target,
                                   cell_id=raw.obs.index.tolist(), 
                                   x=raw.obs["x_array"].tolist(), 
                                   y=raw.obs["y_array"].tolist(), 
                                   pred=raw.obs["pred"].tolist(),
                                   radius=r,
                                   ratio=1/2)

nbr_domians=nbr_domians[0:3]
de_genes_info=spg.rank_genes_groups(input_adata=raw,
                                target_cluster=target,
                                nbr_list=nbr_domians, 
                                label_col="pred", 
                                adj_nbr=True, 
                                log=True)
#Filter genes
de_genes_info=de_genes_info[(de_genes_info["pvals_adj"]<0.05)]
filtered_info=de_genes_info
filtered_info=filtered_info[(filtered_info["pvals_adj"]<0.05) &
                            (filtered_info["in_out_group_ratio"]>min_in_out_group_ratio) &
                            (filtered_info["in_group_fraction"]>min_in_group_fraction) &
                            (filtered_info["fold_change"]>min_fold_change)]
filtered_info=filtered_info.sort_values(by="in_group_fraction", ascending=False)
filtered_info["target_dmain"]=target
filtered_info["neighbors"]=str(nbr_domians)
print("SVGs for domain ", str(target),":", filtered_info["genes"].tolist())

filtered_info
jianhuupenn commented 2 years ago

Hi Davide, I think the reason is the filtering criteria for SVGs are too restricted so that no gene is left in the final table. This happens when some spatial domains do not have very good gene enrichment. I would recommend you to lower the filtering criteria, for example: min_in_group_fraction=0.8 (may be changed to 0.5 or lower) min_in_out_group_ratio=1 (may be changed to 0.5 or lower) min_fold_change=1.5 (may be changed to 1.2 or lower)

This will results in more detected SVGs but with weaker spatial expression patterns.

Best, Jian

davidecrs commented 2 years ago

Hi @jianhuupenn ,

thanks for your answer, but all filtering criteria displayed in https://github.com/jianhuupenn/SpaGCN/issues/2#issuecomment-889241273 are the same of the original tutorial.ipynb of the repo. I'm still not able to reproduce the original results attached in your tutorial.ipynb. Anyway, I tried to apply the changes that you suggested here https://github.com/jianhuupenn/SpaGCN/issues/2#issuecomment-889288382 but still getting no results.

I've started again to extract a minimal workflow to compute spatially variable genes without saving and loading any intermediate data. Just with a linear series of steps, but I'm still finding few inconsistencies that are not allowing me to proceed with the analysis steps.

This is the new workflow (maybe is the same of the previous one) with the output attached in a quote style

import os,csv,re
import pandas as pd
import numpy as np
import scanpy as sc
import math
import SpaGCN as spg
from scipy.sparse import issparse
import random, torch
import warnings
warnings.filterwarnings("ignore")
import matplotlib.colors as clr
import matplotlib.pyplot as plt
import SpaGCN as spg
import cv2
from scanpy import read_10x_h5

adata = read_10x_h5("../tutorial/data/151673/expression_matrix.h5")
spatial=pd.read_csv("../tutorial/data/151673/positions.txt",sep=",",header=None,na_filter=False,index_col=0) 
adata.obs["x1"]=spatial[1]
adata.obs["x2"]=spatial[2]
adata.obs["x3"]=spatial[3]
adata.obs["x4"]=spatial[4]
adata.obs["x5"]=spatial[5]
#Select captured samples
adata=adata[adata.obs["x1"]==1]
adata.var_names=[i.upper() for i in list(adata.var_names)]
adata.var["genename"]=adata.var.index.astype("str")
img=cv2.imread("../tutorial/data/151673/histology.tif")

Output:

Variable names are not unique. To make them unique, call .var_names_make_unique. Variable names are not unique. To make them unique, call .var_names_make_unique. Variable names are not unique. To make them unique, call .var_names_make_unique. Variable names are not unique. To make them unique, call .var_names_make_unique.

#Set coordinates
adata.obs["x_array"]=adata.obs["x2"]
adata.obs["y_array"]=adata.obs["x3"]
adata.obs["x_pixel"]=adata.obs["x4"]
adata.obs["y_pixel"]=adata.obs["x5"]
x_array=adata.obs["x_array"].tolist()
y_array=adata.obs["y_array"].tolist()
x_pixel=adata.obs["x_pixel"].tolist()
y_pixel=adata.obs["y_pixel"].tolist()

#Calculate adjacent matrix
s=1
b=49
adj=spg.calculate_adj_matrix(x=x_pixel,y=y_pixel, x_pixel=x_pixel, y_pixel=y_pixel, image=img, beta=b, alpha=s, histology=True)
#If histlogy image is not available, SoaGCN can calculate the adjacent matrix using the fnction below
# adj=spg.calculate_adj_matrix(x=x_pixel,y=y_pixel, histology=False)

Output:

Calculateing adj matrix using histology image... Var of c0,c1,c2 = 33.30687202862215 174.55510595352243 46.84205750749746 Var of x,y,z = 5606737.526317932 4468793.817921193 5606737.526317932

adata.var_names_make_unique()
spg.prefilter_genes(adata,min_cells=3) # avoiding all genes are zeros
spg.prefilter_specialgenes(adata)
#Normalize and take log for UMI
sc.pp.normalize_per_cell(adata)
sc.pp.log1p(adata)

p=0.5 
#Find the l value given p
l=spg.search_l(p, adj, start=0.01, end=1000, tol=0.01, max_run=100)

Output:

Run 1: l [0.01, 1000], p [0.0, 153.882049263866] Run 2: l [0.01, 500.005], p [0.0, 28.01544761657715] Run 3: l [0.01, 250.0075], p [0.0, 4.240330219268799] Run 4: l [0.01, 125.00874999999999], p [0.0, 0.5157277584075928] Run 5: l [62.509375, 125.00874999999999], p [0.028496861457824707, 0.5157277584075928] Run 6: l [93.7590625, 125.00874999999999], p [0.18753135204315186, 0.5157277584075928] Run 7: l [109.38390625, 125.00874999999999], p [0.32801353931427, 0.5157277584075928] Run 8: l [117.196328125, 125.00874999999999], p [0.41564691066741943, 0.5157277584075928] Run 9: l [121.1025390625, 125.00874999999999], p [0.4640926122665405, 0.5157277584075928] Run 10: l [123.05564453125, 125.00874999999999], p [0.4895068407058716, 0.5157277584075928] recommended l = 124.032197265625

#If the number of clusters known, we can use the spg.search_res() fnction to search for suitable resolution(optional)
#For this toy data, we set the number of clusters=7 since this tissue has 7 layers
n_clusters=7
#Set seed
r_seed=t_seed=n_seed=100
#Seaech for suitable resolution
res=spg.search_res(adata, adj, l, n_clusters, start=0.7, step=0.1, tol=5e-3, lr=0.05, max_epochs=20, r_seed=r_seed, t_seed=t_seed, n_seed=n_seed)

Output:

Start at res = 0.7 step = 0.1 Initializing cluster centers with louvain, resolution = 0.7 Epoch 0 Epoch 10 Res = 0.7 Num of clusters = 7 recommended res = 0.7

clf=spg.SpaGCN()
clf.set_l(l)
#Set seed
random.seed(r_seed)
torch.manual_seed(t_seed)
np.random.seed(n_seed)
#Run
clf.train(adata,adj,init_spa=True,init="louvain",res=res, tol=5e-3, lr=0.05, max_epochs=200)
y_pred, prob=clf.predict()
adata.obs["pred"]= y_pred
adata.obs["pred"]=adata.obs["pred"].astype('category')
#Do cluster refinement(optional)
#shape="hexagon" for Visium data, "square" for ST data.
adj_2d=spg.calculate_adj_matrix(x=x_array,y=y_array, histology=False)
refined_pred=spg.refine(sample_id=adata.obs.index.tolist(), pred=adata.obs["pred"].tolist(), dis=adj_2d, shape="hexagon")
adata.obs["refined_pred"]=refined_pred
adata.obs["refined_pred"]=adata.obs["refined_pred"].astype('category')

Output:

Initializing cluster centers with louvain, resolution = 0.7 Epoch 0 Epoch 10 Epoch 20 Epoch 30 Epoch 40 Epoch 50 Epoch 60 Epoch 70 Epoch 80 Epoch 90 Epoch 100 Epoch 110 Epoch 120 Epoch 130 Epoch 140 Epoch 150 Epoch 160 Epoch 170 Epoch 180 Epoch 190 Calculateing adj matrix using xy only...

In the next cell there are few inconsistencies:

  1. I had to add raw = adata because in the original tutorial.ipynb you load the same data raw=sc.read("../tutorial/data/151673/sample_data.h5ad") that is already loaded.
  2. with this line raw.obs["pred"]=adata.obs["pred"].astype('category') you are referring to adata from raw variable which is basically the same object (if I understood correctly), but I do not understand very well why keep both the variables.
  3. The WARNING output is due to sc.pp.log1p(raw) which was already run in the third chunk, so I think I have to remove it. Am I right ?
    
    #Read in raw data
    raw = adata

raw.var_names_make_unique() raw.obs["pred"]=adata.obs["pred"].astype('category') raw.obs["x_array"]=raw.obs["x2"] raw.obs["y_array"]=raw.obs["x3"] raw.obs["x_pixel"]=raw.obs["x4"] raw.obs["y_pixel"]=raw.obs["x5"]

Convert sparse matrix to non-sparse

raw.X=(raw.X.A if issparse(raw.X) else raw.X) raw.raw=raw sc.pp.log1p(raw)

Output:
> WARNING: adata.X seems to be already log-transformed.

Use domain 0 as an example

target=0

Set filtering criterials

min_in_group_fraction=0.8 min_in_out_group_ratio=1 min_fold_change=1.5

Search radius such that each spot in the target domain has approximately 10 neighbors on average

adj_2d=spg.calculate_adj_matrix(x=x_array, y=y_array, histology=False) start, end= np.quantile(adj_2d[adj_2d!=0],q=0.001), np.quantile(adj_2d[adj_2d!=0],q=0.1) r=spg.search_radius(target_cluster=target, cell_id=adata.obs.index.tolist(), x=x_array, y=y_array, pred=adata.obs["pred"].tolist(), start=start, end=end, num_min=10, num_max=14, max_run=100)

Detect neighboring domains

nbr_domians=spg.find_neighbor_clusters(target_cluster=target, cell_id=raw.obs.index.tolist(), x=raw.obs["x_array"].tolist(), y=raw.obs["y_array"].tolist(), pred=raw.obs["pred"].tolist(), radius=r, ratio=1/2)

nbr_domians=nbr_domians[0:3] de_genes_info=spg.rank_genes_groups(input_adata=raw, target_cluster=target, nbr_list=nbr_domians, label_col="pred", adj_nbr=True, log=True)

Filter genes

de_genes_info=de_genes_info[(de_genes_info["pvals_adj"]<0.05)] filtered_info=de_genes_info filtered_info=filtered_info[(filtered_info["pvals_adj"]<0.05) & (filtered_info["in_out_group_ratio"]>min_in_out_group_ratio) & (filtered_info["in_group_fraction"]>min_in_group_fraction) & (filtered_info["fold_change"]>min_fold_change)] filtered_info=filtered_info.sort_values(by="in_group_fraction", ascending=False) filtered_info["target_dmain"]=target filtered_info["neighbors"]=str(nbr_domians) print("SVGs for domain ", str(target),":", filtered_info["genes"].tolist())


Output:
> Calculateing adj matrix using xy only...
> Calculateing adj matrix using xy only...
> Calculateing adj matrix using xy only...
> Run 1: radius [1.4142135381698608, 16.970561981201172], num_nbr [1.0, 411.86907449209934]
> Calculateing adj matrix using xy only...
> Run 2: radius [1.4142135381698608, 9.192387759685516], num_nbr [1.0, 132.54627539503386]
> Calculateing adj matrix using xy only...
> Run 3: radius [1.4142135381698608, 5.303300648927689], num_nbr [1.0, 44.18510158013544]
> Calculateing adj matrix using xy only...
> Run 4: radius [1.4142135381698608, 3.3587570935487747], num_nbr [1.0, 20.74943566591422]
> Calculateing adj matrix using xy only...
> Run 5: radius [2.386485315859318, 3.3587570935487747], num_nbr [8.936794582392777, 20.74943566591422]
> Calculateing adj matrix using xy only...
> recommended radius =  2.8726212047040462 num_nbr=12.880361173814899
> ... storing 'feature_types' as categorical
> ... storing 'genome' as categorical

> radius= 2.8726212047040462 average number of neighbors for each spot is 12.880361173814899
>  Cluster 0 has neighbors:
> Dmain  3 :  931
> Dmain  4 :  907

> ... storing 'genename' as categorical

> SVGs for domain  0 : []

I'm having some difficulties to extract a workflow without save and load intermediate data to perform the spatial variable genes analysis. Moreover, with the original `tutorial.ipynb` you provide I cannot reproduce your results, even if I change nothing.
Do you have any idea why I cannot obtain your results, both with the original jupyter notebook and my extracted workflow?

Thanks in advance.

Best regards,
Davide
jianhuupenn commented 2 years ago

Hi Davide,

  1. After spatial domain detection, I reload the raw count data because the "data" object is already modified in:

    spg.prefilter_genes(adata,min_cells=3) # avoiding all genes are zeros
    spg.prefilter_specialgenes(adata)
    #Normalize and take log for UMI
    sc.pp.normalize_per_cell(adata)
    sc.pp.log1p(adata)

    The SVG detection and filtering are based on log(UMI+1), therefore, I reload the data as "raw" for downstream analysis.

  2. To check why you cannot reproduce the results, I think it would be better to first check the spatial domains. Could you plot the spatial domains from SpaGCN? It should be similar to the figures I showed in the tutorial even if you use a different seed and a different environment. After that, we can furtherly check the SVGs.

Best, Jian

davidecrs commented 2 years ago

Hi @jianhuupenn ,

thanks for your answer. These are the plots of spatial domains Screenshot from 2021-09-23 09-32-14

jianhuupenn commented 2 years ago

Hi Davide, The spatial domain looks fine to me. In SVG detection for domain 0(red), the neighbouring domains are also correctly detected (domains 3 and 4 in green and blue). To furtherly check why no SVG was detected for domain 0 is, here is what I suggest to do:

  1. Detect SVGs for other domains, to see if some SVGs can be detected. I think if you use log(UMI+1), you should results in a similar gene list as I got (attached below).

  2. For domain 0, you can check why the genes with significant p values(<0.05) are filtered out by looking at:

    de_genes_info

    This table contains all the DE genes for domain 0 and statistics such as "in_out_group_ratio"and "in_group_fraction".

  3. I also attached my spatial domain and SVGs for this data FYI:

    Screen Shot 2021-09-23 at 2 33 38 PM

0: ['CAMK2N1', 'ENC1', 'GPM6A', 'ARPP19', 'HPCAL1'], 1: ['PCP4'], 2: [], 3: ['NEFL', 'NEFM', 'SNCG'], 4: ['MBP', 'PLP1'], 5: ['MBP', 'PTGDS', 'FTL', 'GFAP', 'FTH1', 'CRYAB', 'PLP1', 'CNP', 'TUBA1A', 'EEF1A1', 'S100B', 'RPS12', 'APLP1', 'RPL30', 'TF', 'RPS8', 'MOBP', 'PTMA', 'SCD', 'SEPT4', 'CLDND1', 'GPM6B', 'MARCKSL1', 'APOD', 'PPP1R14A', 'MAG', 'C4ORF48', 'SPP1', 'PLEKHB1', 'RNASE1', 'QDPR', 'QKI', 'FEZ1', 'GPRC5B', 'PAQR6', 'CSRP1', 'TSC22D4', 'SELENOP', 'CLDN11', 'ERMN', 'GSN', 'ABCA2', 'LINC00844', 'BCAS1', 'SIRT2', 'EDIL3', 'CNTN2', 'MOG', 'OLIG1', 'CARNS1', 'ENPP2', 'SEPT8', 'MAL'], 6: ['RTN1', 'CCK', 'CHN1']

You can find that domain 0 in your results is labelled as domain 1 in my results, and only one gene, "PCP4" is detected as SVG. It is normal that some domain has few or no SVG after the strong filtering criteria. In this case, I would suggest lowering the filtering criteria or using a meta gene to mark that domain.

Thanks, Jian