theislab / scvelo

RNA Velocity generalized through dynamical modeling
https://scvelo.org
BSD 3-Clause "New" or "Revised" License
412 stars 102 forks source link

warning : gene not found after scv.pl.scatter #235

Closed akramdi closed 3 years ago

akramdi commented 4 years ago

Hello everyone,

I'm just getting started at scVelo and I struggled a lot to put together the input data that combines splicing info from the loom file obtained by velocyto and extra information (umap, clusters, cells) stored in a Seurat object after data analysis. I followed the steps Originally posted by @Mevelo in https://github.com/theislab/scvelo/issues/161#issuecomment-591445362 in order to use scv.utils.merge that does the job of merging the two. However, at one point, when I try to interpret the velocity by plotting some marker genes using scv.pl.scatter(adata, gene_names), I get a warning that some genes are not found. This rises a big red flag for me as these are my favourite genes and they should be expressed (it is the case in the Seurat analysis). This made me doubt the initial steps where I put the input together. Could you please tell me if I'm doing something wrong?

library(Seurat)
library(SeuratDisk)
library(SeuratWrappers)

sobj.processed.sc <- readRDS(RDSfile) 
DefaultAssay(sobj.processed.sc)<-'RNA'
outFILE=sprintf("%s/%s.h5Seurat", outputDIR, sampleName) 
Convert(outFILE, dest = "h5ad", overwrite = T)
SaveH5Seurat(sobj.processed.sc, filename = outFILE, overwrite=T)
#======= Import loom file obtained from velocyto
adata = scv.read(lFile, cache=True)
adata.var_names_make_unique() #Because I get: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
>>> adata
AnnData object with n_obs × n_vars = 12209 × 33538
    obs: 'initial_size_spliced', 'initial_size_unspliced', 'initial_size'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
    layers: 'ambiguous', 'matrix', 'spliced', 'unspliced'

#======= Read Seurat object converted to .h5ad
sc_data = scv.read(seurat_h5adFile, cache=True)
>>> sc_data
AnnData object with n_obs × n_vars = 8120 × 19695
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'percent.mt', 'percent.RPL', 'percent.RPS', 'S.Score', 'G2M.Score', 'Phase', 'CC.Difference', 'nCount_SCT', 'nFeature_SCT', 'SCT_snn_res.0.8', 'seurat_clusters'
    var: 'features'
    obsm: 'X_umap', 'X_umap.graph'

#======== Merge both
sceMerge = scv.utils.merge(adata, sc_data)
>>> sceMerge
AnnData object with n_obs × n_vars = 8120 × 19679
    obs: 'initial_size_spliced', 'initial_size_unspliced', 'initial_size', 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'percent.mt', 'percent.RPL', 'percent.RPS', 'S.Score', 'G2M.Score', 'Phase', 'CC.Difference', 'nCount_SCT', 'nFeature_SCT', 'SCT_snn_res.0.8', 'seurat_clusters'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'features'
    obsm: 'X_umap', 'X_umap.graph'
    layers: 'ambiguous', 'matrix', 'spliced', 'unspliced'

Does the structure of sceMerge object look ok?

#== Filter and noramlize /  Estimation of velocities /Get velocity graph
scv.pp.filter_and_normalize(sceMerge, min_shared_counts=20, n_top_genes=2000)

Filtered out 10401 genes that are detected 20 counts (shared).
Normalized count data: X, spliced, unspliced.
Logarithmized X.

scv.pp.moments(sceMerge, n_pcs=30, n_neighbors=30)

computing neighbors
    finished (0:00:05) --> added 
    'distances' and 'connectivities', weighted adjacency matrices (adata.obsp)
computing moments based on connectivities
    finished (0:00:01) --> added 
    'Ms' and 'Mu', moments of spliced/unspliced abundances (adata.layers)

scv.tl.velocity(sceMerge)
scv.tl.velocity_graph(sceMerge)

#-- Here's here I get the warning: 
>>> MESgenes=['CD99',  'FN1']
>>> scv.pl.scatter(sceMerge, MESgenes, ncols=2)
WARNING: CD99 not found.

Does anything seem off? Because I'm missing some genes, I was wondering if scv.utils.merge function also merges features between the two files ? or is it used to subset cells and import umap embeddings and such?

Thank you in advance for your comments! Best,

saum-kmr commented 3 years ago

I am also wondering if there is any explanation as to why I am missing genes ? Most of the genes of interest in my analysis that I got using Velocyto.R do not come up using scVelo.

WeilerP commented 3 years ago

@akramdi, yes, scv.utils.merge performs an inner merge, i.e. only features contained in both AnnData objects are kept.

>>> import numpy as np
>>> import pandas as pd
>>> from anndata import AnnData
>>> import scvelo as scv
>>>
>>> adata_1 = AnnData(pd.DataFrame(np.eye(2), columns=['gene_1', 'gene_2']))
>>> adata_2 = AnnData(pd.DataFrame(np.eye(2), columns=['gene_1', 'gene_3']))
>>>
>>> adata = scv.utils.merge(adata_1, adata_2)
>>> adata
AnnData object with n_obs × n_vars = 2 × 1
>>> adata.var_names
Index(['gene_1'], dtype='object')

You should check that

  1. the genes are present in both AnnData objects
  2. the genes are not being filtered out by scv.pp.filter_and_normalize. In scvelo>=0.2.2 you can use the keyword argument retain_genes to exclude specific genes from being filtered out.