single-cell-genetics / cellsnp-lite

Efficient genotyping bi-allelic SNPs on single cells
https://cellsnp-lite.readthedocs.io
Apache License 2.0
124 stars 11 forks source link

cellsnp-lite output not working with Vireo - reporting really high read depths #33

Open dannyconrad opened 2 years ago

dannyconrad commented 2 years ago

I have a scATACseq dataset where I'm trying to use the mitochondrial reads to demultiplex two mouse strains. When I run cellSNP and cellsnp-lite in mode 2 using what I believe to be matching parameters and the same input files, both identify all of the same SNPs, but the read depths reported in cellSNP.base.vcf.gz and the sparse matrices are different; the depths reported by cellsnp-lite are ~100X greater. When I run vireo on these two different outputs, the cellSNP results seem to properly demultiplex my donors, whereas the cellsnp-lite results leave all cells unassigned. Importantly, I noticed in the donor_ids.tsv file for the cellsnp-lite vireo run, that all cells are reported to have the same results:

cell    donor_id    prob_max    prob_doublet    n_vars  best_singlet    best_doublet    doublet_logLikRatio
AAACGAAAGGTCGTTT-1  unassigned  4.76e-01    4.79e-02    370 donor0  donor0,donor1   0.000

I guess I'm not sure if this is a bug with cellsnp-lite or if I did something drastically different running the two programs without realizing it. Any ideas?

Here are the two batches of code I ran and the relevant console outputs:

cellsnp-lite

cellsnp-lite \
-s ./chrM_CB.bam \
-b ./barcodes_mt_revcomp.tsv \
-O ./cellsnp-lite_DeNovo \
-p 40 \
--minMAF 0.1 \
--UMItag None \
--chrom M
--exclFLAG UNMAP,SECONDARY,QCFAIL
[I::main] start time: 2021-11-24 11:49:27
[W::check_args] Max depth set to maximum value (2147483647)
[I::main] mode 2a: pileup 1 whole chromosomes in 4787 single cells.
[W::csp_pileup_core] Combined max depth is above 1M. Potential memory hog!
[I::csp_pileup_core][Thread-0] processing chrom M ...
[I::csp_pileup_core][Thread-0] has pileup-ed in total 370 SNPs for chrom M
[I::main] All Done!
[I::main] end time: 2021-11-24 13:12:30
[I::main] time spent: 4983 seconds.

less cellsnp-lite_DeNovo/cellSNP.base.vcf.gz | head 
##fileformat=VCFv4.2
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
M   55  .   T   G   .   PASS    AD=146211;DP=489229;OTH=247
M   165 .   C   A   .   PASS    AD=142350;DP=607292;OTH=268
M   348 .   T   C   .   PASS    AD=103804;DP=350149;OTH=251

vireo \
-c ./cellsnp-lite_DeNovo/ \
-N 2 \
-o ./vireo/cellsnp-lite_DeNovo \
-p 40
[vireo] Loading cell folder ...
[vireo] Demultiplex 4787 cells to 2 donors with 370 variants.
[vireo] lower bound ranges [-114928564.4, -114928564.4, -114928564.4]
[vireo] allelic rate mean and concentrations:
[[0.01  0.284 0.99 ]]
[[5.00000000e+01 2.39958576e+08 5.00000000e+01]]
[vireo] donor size before removing doublets:
donor0  donor1
2394    2394
[vireo] final donor size:
unassigned
4787
[vireo] All done: 0 min 16.0 sec

cellSNP

cellSNP \
-s ./chrM_CB.bam \
-b ./barcodes_mt_revcomp.tsv \
-O ./cellSNP_DeNovo \
-p 40 \
--minMAF=0.1 \
--UMItag=None \
--chrom=M \
--maxFLAG=4096
[cellSNP] mode 2: pileup 1 whole chromosomes in 4787 single cells.
[cellSNP] Whole genome pileupped, now merging all variants ...
[cellSNP] 408 lines in final vcf file
[cellSNP] All done: 64 min 43.5 sec

less cellSNP_DeNovo/cellSNP.base.vcf.gz | head 
##fileformat=VCFv4.2
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
chrM    55  .   T   G   .   PASS    AD=1924;DP=7534;OTH=2
chrM    165 .   C   A   .   PASS    AD=2100;DP=7740;OTH=0
chrM    348 .   T   C   .   PASS    AD=2749;DP=7402;OTH=2

vireo \
-c ./cellSNP_DeNovo/cellSNP.cells.vcf.gz \
-N 2 \
-o ./vireo/cellSNP_DeNovo \
-p 40
[vireo] Loading cell VCF file ...
[vireo] Demultiplex 4787 cells to 2 donors with 370 variants.
[vireo] lower bound ranges [-1585667.0, -1585667.0, -430338.7]
[vireo] allelic rate mean and concentrations:
[[0.038 0.907 0.938]]
[[1857003.8  450325.5  418458.7]]
[vireo] donor size before removing doublets:
donor0  donor1
1735    3052
[vireo] final donor size:
donor0  donor1  doublet unassigned
1581    2891    286 29
[vireo] All done: 0 min 15.2 sec
hxj5 commented 2 years ago

Hi, thanks for your feedback. Could you share the version of cellSNP and cellsnp-lite?

The reason why the read counts are different: cellSNP (actually the dependent pysam) has a default limitation that the max_depth (i.e., max pileup-ed read count) is 8000, so you may check that every DP<=8000 in the cellSNP output. But cellsnp-lite does not have this max_depth limitation, it will pileup as many reads as possible. (we are trying to add a --maxPileup option with similar function as max_depth, for the next release of cellsnp-lite)

The reason why cellsnp-lite+vireo does not work well: The large read counts in cellsnp-lite output makes it more likely for vireo to reach local optima so that the parameters of two donors become the same and hence vireo cannot assign the cells to certain donor.

Besides, vireo is designed for nuclear SNVs. For mito SNVs, you may want to try this tutorial, which was used by MQuad. Note that the duplicate reads should probably be removed beforehand, as there are not UMIs in your case.

Xianjie