Data format of the second tutorial

kaizhang / SnapATAC2

Single-cell epigenomics analysis tools

https://kzhang.org/SnapATAC2/

223 stars 24 forks source link

Data format of the second tutorial #138

Closed ChuanXu1 closed 1 year ago

ChuanXu1 commented 1 year ago

Dear SnapATAC2 team,

In this tutorial https://kzhang.org/SnapATAC2/tutorials/annotation.html, I think the query is not in raw count. How can it be concatenated together with reference for scVI integration? Anything I miss?

kaizhang commented 1 year ago

I don't understand you question. Does this work for you?

import scanpy as sc
reference = sc.read(snap.datasets.pbmc10k_multiome())
query = sc.read(snap.datasets.pbmc5k(type='gene'))
data = reference.concatenate(query, batch_categories=["reference", "query"])

ChuanXu1 commented 1 year ago

@kaizhang, according to the command of the tutorial (see below, counts and nb), data is a raw count rather than normalized expression

scvi.model.SCVI.setup_anndata(data, layer="counts", batch_key="batch")
vae = scvi.model.SCVI(
    data,
    n_layers=2,
    n_latent=30,
    gene_likelihood="nb",
    dispersion="gene-batch",
)

But seems query is not a raw count matrix

import snapatac2 as snap
query = sc.read(snap.datasets.pbmc5k(type='gene'))
query.X.data
>>>>array([0.8492694, 0.8492694, 2.6657953, ..., 2.2745466, 1.1571571, 1.1571571], dtype=float32)
query.X.data.max()
>>>>7.6579537
query.X.expm1().sum(axis=1)
>>>>matrix([[38354.004],
        [38354.004],
        [38354.004],
        ...,
        [38354.   ],
        [38353.996],
        [38354.   ]], dtype=float32)

kaizhang commented 1 year ago

"query" is from scATAC-seq data. That's why we don't have a raw gene count matrix. The 'X' in "query" is a cell-by-gene activity matrix estimated from gene body accessibility.

ChuanXu1 commented 1 year ago

Then how can this non-count matrix be used for scVI integration with a negative binomial distribution? Sorry if misunderstand anything

kaizhang commented 1 year ago

You are right. Strictly speaking, the negative binomial distribution should be used only for modeling count data. But I guess it can still be useful for approximating the distribution of continuous data. You may also want to consult with the authors of the scvi-tools, as they used a similar approach when integrating scATAC-seq and scRNA-seq data: https://docs.scvi-tools.org/en/stable/tutorials/notebooks/peakvi_in_R.html#integrating-with-scrna-seq-data-scanvi. They may have a better answer to your question.

ChuanXu1 commented 1 year ago

alright, thx

wangjiawen2013 commented 1 year ago

It is scvi.model.PEAKVI.setup_anndata on https://docs.scvi-tools.org/en/stable/tutorials/notebooks/atac/PeakVI.html, not scvi.model.SCVI.setup_anndata. Why snapatac2 uses scvi instead of PeakVI ?

wangjiawen2013 commented 1 year ago

You are right. Strictly speaking, the negative binomial distribution should be used only for modeling count data. But I guess it can still be useful for approximating the distribution of continuous data. You may also want to consult with the authors of the scvi-tools, as they used a similar approach when integrating scATAC-seq and scRNA-seq data: https://docs.scvi-tools.org/en/stable/tutorials/notebooks/peakvi_in_R.html#integrating-with-scrna-seq-data-scanvi. They may have a better answer to your question.

It's still count data after running snap.pp.make_gene_matrix(data, snap.genome.hg38). I think this data should be used for scvi-tools. However, it became non-count matrix after running import scanpy as sc sc.pp.filter_genes(gene_matrix, min_cells=5) sc.pp.normalize_total(gene_matrix) sc.pp.log1p(gene_matrix)

So it's better to revise snapatac2 tutorial to make it clear. Now the tutorial uses the logarithmized data, which makes users confused.