Closed ChuanXu1 closed 1 year ago
I don't understand you question. Does this work for you?
import scanpy as sc
reference = sc.read(snap.datasets.pbmc10k_multiome())
query = sc.read(snap.datasets.pbmc5k(type='gene'))
data = reference.concatenate(query, batch_categories=["reference", "query"])
@kaizhang, according to the command of the tutorial (see below, counts
and nb
), data
is a raw count rather than normalized expression
scvi.model.SCVI.setup_anndata(data, layer="counts", batch_key="batch")
vae = scvi.model.SCVI(
data,
n_layers=2,
n_latent=30,
gene_likelihood="nb",
dispersion="gene-batch",
)
But seems query is not a raw count matrix
import snapatac2 as snap
query = sc.read(snap.datasets.pbmc5k(type='gene'))
query.X.data
>>>>array([0.8492694, 0.8492694, 2.6657953, ..., 2.2745466, 1.1571571, 1.1571571], dtype=float32)
query.X.data.max()
>>>>7.6579537
query.X.expm1().sum(axis=1)
>>>>matrix([[38354.004],
[38354.004],
[38354.004],
...,
[38354. ],
[38353.996],
[38354. ]], dtype=float32)
"query" is from scATAC-seq data. That's why we don't have a raw gene count matrix. The 'X' in "query" is a cell-by-gene activity matrix estimated from gene body accessibility.
Then how can this non-count matrix be used for scVI integration with a negative binomial distribution? Sorry if misunderstand anything
You are right. Strictly speaking, the negative binomial distribution should be used only for modeling count data. But I guess it can still be useful for approximating the distribution of continuous data. You may also want to consult with the authors of the scvi-tools, as they used a similar approach when integrating scATAC-seq and scRNA-seq data: https://docs.scvi-tools.org/en/stable/tutorials/notebooks/peakvi_in_R.html#integrating-with-scrna-seq-data-scanvi. They may have a better answer to your question.
alright, thx
It is scvi.model.PEAKVI.setup_anndata on https://docs.scvi-tools.org/en/stable/tutorials/notebooks/atac/PeakVI.html, not scvi.model.SCVI.setup_anndata. Why snapatac2 uses scvi instead of PeakVI ?
You are right. Strictly speaking, the negative binomial distribution should be used only for modeling count data. But I guess it can still be useful for approximating the distribution of continuous data. You may also want to consult with the authors of the scvi-tools, as they used a similar approach when integrating scATAC-seq and scRNA-seq data: https://docs.scvi-tools.org/en/stable/tutorials/notebooks/peakvi_in_R.html#integrating-with-scrna-seq-data-scanvi. They may have a better answer to your question.
It's still count data after running snap.pp.make_gene_matrix(data, snap.genome.hg38). I think this data should be used for scvi-tools. However, it became non-count matrix after running import scanpy as sc sc.pp.filter_genes(gene_matrix, min_cells=5) sc.pp.normalize_total(gene_matrix) sc.pp.log1p(gene_matrix)
So it's better to revise snapatac2 tutorial to make it clear. Now the tutorial uses the logarithmized data, which makes users confused.
Dear SnapATAC2 team,
In this tutorial https://kzhang.org/SnapATAC2/tutorials/annotation.html, I think the
query
is not in raw count. How can it be concatenated together withreference
for scVI integration? Anything I miss?