theislab / scvelo

RNA Velocity generalized through dynamical modeling
https://scvelo.org
BSD 3-Clause "New" or "Revised" License
414 stars 102 forks source link

X will not be logtransformed if loom is pasted to 10X matrix #857

Closed hyjforesight closed 2 years ago

hyjforesight commented 2 years ago

Hello scVelo and @WeilerP, When I load the 10X raw matrix and paste unspliced and spliced information into it, scv.pp.filter_and_normalize() cannot log-transform the X. Please see the below codes. Appreciate it if you could help me with this issue. Thanks! Best, YJ

If paste unspliced and spliced information into 10X matrix, cannot log-transform the X.

ACTWT = sc.read_10x_mtx(path='C:/Users/Park_Lab/Documents/filtered_feature_bc_matrix/', var_names='gene_symbols')    # load 10X matrix
ACTWT.var_names_make_unique()
ACTWT
AnnData object with n_obs × n_vars = 18228 × 32285
    var: 'gene_ids', 'feature_types'

ldata = scv.read(filename='C:/Users/Park_Lab/Documents/Tumor.loom')
test = scv.utils.merge(ACTWT, ldata)    # paste unspliced and spliced information into it
AnnData object with n_obs × n_vars = 18228 × 32285
    obs: 'Clusters', '_X', '_Y', 'initial_size_spliced', 'initial_size_unspliced', 'initial_size'
    var: 'gene_ids', 'feature_types', 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
    layers: 'matrix', 'ambiguous', 'spliced', 'unspliced'

scv.pp.filter_and_normalize(test, min_shared_counts=20, n_top_genes=2000)    # cannot log-transform the X
Filtered out 22229 genes that are detected 20 counts (shared).
Normalized count data: X, spliced, unspliced.
Extracted 2000 highly variable genes.
WARNING: Did not modify X as it looks preprocessed already.    # should be `Logarithmized X`

This BUG is not caused by loom file itself, because if I load loom file, everything works fine.

adata = scv.read(filename='C:/Users/Park_Lab/Documents/Tumor.loom')
adata
AnnData object with n_obs × n_vars = 18228 × 32285
    obs: 'Clusters', '_X', '_Y'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
    layers: 'matrix', 'ambiguous', 'spliced', 'unspliced'

scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
Filtered out 22229 genes that are detected 20 counts (shared).
Normalized count data: X, spliced, unspliced.
Extracted 2000 highly variable genes.
Logarithmized X.
Error output ```pytb WARNING: Did not modify X as it looks preprocessed already. ```
Versions ```pytb scanpy==1.8.2 anndata==0.7.8 umap==0.5.2 numpy==1.21.5 scipy==1.7.3 pandas==1.3.5 scikit-learn==1.0.2 statsmodels==0.13.1 python-igraph==0.9.9 pynndescent==0.5.5 scvelo==0.2.4 scanpy==1.8.2 anndata==0.7.8 loompy==3.0.6 numpy==1.21.5 scipy==1.7.3 matplotlib==3.5.1 sklearn==1.0.2 pandas==1.3.5 cellrank==1.5.1 scanpy==1.8.2 anndata==0.7.8 numpy==1.21.5 numba==0.55.0 scipy==1.7.3 pandas==1.3.5 pygpcca==1.0.3 scikit-learn==1.0.2 statsmodels==0.13.1 python-igraph==0.9.9 scvelo==0.2.4 pygam==0.8.0 matplotlib==3.5.1 seaborn==0.11.2 ```
WeilerP commented 2 years ago

@hyjforesight, I presume adata.X and adata.layers['spliced'] are not the same array when you load the 10x matrix first and then the LOOM file? scVelo expects the same entry in adata.X and adata.layers['spliced']. If the two are different this flag will be False.

hyjforesight commented 2 years ago

Hello @WeilerP Thanks for the response. The 10X matrix and loom were generated from the same FASTQ files. I did CellRanger to generate 10X matrix and used velocyto.py to generate loom file from the same CellRanger outputs. Their cell numbers and gene numbers are the same (as the code shown).

ACTWT = sc.read_10x_mtx(path='C:/Users/Park_Lab/Documents/filtered_feature_bc_matrix/', var_names='gene_symbols')    # load 10X matrix
ACTWT.var_names_make_unique()
ACTWT
AnnData object with n_obs × n_vars = 18228 × 32285
    var: 'gene_ids', 'feature_types'

loom = scv.read(filename='C:/Users/Park_Lab/Documents/Tumor.loom')
loom
AnnData object with n_obs × n_vars = 18228 × 32285
    obs: 'Clusters', '_X', '_Y'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
    layers: 'matrix', 'ambiguous', 'spliced', 'unspliced'

Is it because there're some obs (obs: 'Clusters', '_X', '_Y') variables in loom file? image

Thanks! Best, YJ

WeilerP commented 2 years ago

@hyjforesight, unless I misunderstand, adata.X and adata.layers['spliced'] are not the same since one is the CellRanger output (i.e., unaligned reads) and the other comes from the velocyto pipeline (i.e., a fraction of the former) You can double check this by running

(adata.X != adata.layers['spliced']).nnz == 0

If they are equal this would return True.

hyjforesight commented 2 years ago

Hello @WeilerP Thanks for your explanation. Now I have a better understanding of the loom file. I ran (adata.X != adata.layers['spliced']).nnz == 0 and it was False. In this case, is there any way to make this pasted adata work for the log-transformation? Because we want to do scVelo on a subset of the 10X matrix. I loaded the raw loom file and tried to do a view, expecting to use this view for scVelo, but it failed because some prefixes and suffixes were automatically added onto the index. Loom file cannot be sliced by the index of h5ad. Plese see the codes.

loom = sc.read_loom(filename='C:/Users/Park_Lab/Documents/Tumor.loom')
loom.obs['Clusters']
CellID
Tumor:AAAGAACCAATAGTGAx    13
Tumor:AAACGCTCAGCAGTTTx    10
Tumor:AAACGAACACAAGCAGx     2
Tumor:AAAGGATCATGGCACCx     5

ACTWT = sc.read('C:/Users/Park_Lab/Documents/ACTWT.h5ad')
loom.obs['batch']= ACTWT.obs['batch']
adata=loom[loom.obs['batch'].isin(['ACTWT'])]
adata
View of AnnData object with n_obs × n_vars = 0 × 32285

ACTWT.obs['leiden']
AAACCCAAGAAAGTCT-1-ACTWT    6
AAACCCAAGATACTGA-1-ACTWT    3

How about running the scv.pp.filter_and_normalize() step by step like this? Does it do log-transformation but I didn't see any notification like Logarithmized X? Is it possible to enforce the log-transformation?

ACTWT = sc.read_10x_mtx(path='C:/Users/Park_Lab/Documents/Tumor/filtered_feature_bc_matrix/', var_names='gene_symbols')    # load 10X matrix
ACTWT.var_names_make_unique()
ACTWT

ldata = scv.read(filename='C:/Users/Park_Lab/Documents/Tumor.loom')
adata = scv.utils.merge(ACTWT, ldata)    # paste unspliced and spliced information into it

scv.pp.filter_genes(adata, min_shared_counts=20)
scv.pp.normalize_per_cell(adata)
scv.pp.filter_genes_dispersion(adata, n_top_genes=2000)
scv.pp.log1p(adata)
Filtered out 22229 genes that are detected 20 counts (shared).
Normalized count data: X, spliced, unspliced.
Extracted 2000 highly variable genes.

Thanks! Best, YJ

WeilerP commented 2 years ago

Yes, you can run the steps individually. You don't get a messagt because the logging is happening in the filter_and_normalize function. Alternatively, you could also run scv.pp.filter_and_normalize and then call scv.pp.log1p afterwards.

weir12 commented 1 year ago

@WeilerP Hello, Plz allow me ask a question: why do we need to extract the count matrix (umi count) from the cellranger output folder and combine the spliced/unspliced information of loom files. In my understanding, the loom file generated by the velocyto output provides all the information needed for scvelo. Is the master matrix X of h5ad produced by velocyto the original umi count? I believe that the raw UMI count matrix filtered_feature_bc_matrix produced by Cell Ranger counts should be identical to the matrix layer in the loom output by velocyto or the .X of the h5ad file. A side note: The master matrix X of anndata that scvelo finally outputs seems to be the count matrix (normalized and logarithmic), while the RNA velocity related information is kept in layers

If my conceptual understanding is incorrect, please point it out for me. Thanks! Best, Weir

weir12 commented 1 year ago
import scanpy as sc
import scvelo as scv
adata = scv.read("/cluster/home/velocyto/result.h5ad")
adata = adata.transpose()
scv.pp.filter_and_normalize(adata)
scv.pp.moments(adata)
scv.tl.velocity(adata)
scv.tl.velocity_graph(adata,n_jobs=64)

In my processing pipeline, velocyto acts as an upstream tool for scvelo, and the resulting h5ad file is utilized by scvelo to calculate the RNA velocity of each gene for each cell.I don't know if it's reasonable to do that