single-cell-genetics / cellsnp-lite

Efficient genotyping bi-allelic SNPs on single cells
https://cellsnp-lite.readthedocs.io
Apache License 2.0
124 stars 11 forks source link

In a given SNP region, what are the criteria being used to categorize a cell as WT(0/0), heterozygote(1/0), and homozygote(1/1)? #109

Closed jiehuichen closed 7 months ago

jiehuichen commented 7 months ago

Dear Xianjie,

May I ask a naive question?

In a given SNP region, what are the criteria being used to categorize a cell as WT(0/0), heterozygote(1/0), and homozygote(1/1)?

Are they categorized by the count/percentage of REF reads and count/percentage of ALT reads in a cell?

BTW, when I added "--minMAF 0.1 --minCOUNT 20", I found only 3 of 10 SNP regions can be called in these single cells. These 10 SNP regions can be called without the filtering. I'm a little bit confusing.

Many thanks.

hxj5 commented 7 months ago

Hi, for genotyping in single cells, cellsnp-lite first needs to know the REF and ALT alleles. These two alleles can be either specified by users (-R option in mode 1), or de novo inferred from data (in mode 2). After that, cellsnp-lite will perform genotyping, to select the genotype with the maximum likelihood in each single cell, with the error model as presented in Table 1 in Jun et al., 2012.

The two options --minMAF and --minCOUNT are used for filtering SNPs in a pseudo-bulk manner, not in each single cell. The corresponding "MAF and COUNT" values are calculated based on aggregated read/UMI counts of all cells.

jiehuichen commented 7 months ago

Thanks for you help.

I picked up several examples from "cellSNP.cells.vcf". PS: I specified the REF and ALT alleles in the command line.

"1/0:1:5:1:31,29,147:4,0,1,1,0" . It means: Allele Depth=1,
REF Depth=5-1=4, Depths of all alleles other than REF and ALT=1.

"0/0:0:1:0:0,3,39:0,0,1,0,0" It means: Allele Depth=0,
REF Depth=1-0=1, Depths of all alleles other than REF and ALT=0.

"1/1:3:3:0:116,9,0:0,0,0,3,0" It means: Allele Depth=3,
REF Depth=3-3=0, Depths of all alleles other than REF and ALT=0.

It seems the genotypes could be identified by the Depth of Allele and REF if they have counts in ALT or REF. I'm a little confused about the maximum likelihood and the error model, how the max likelihood be defined? How the error model was used?

Thank you for your patience in answering these naive questions.

Best,

hxj5 commented 7 months ago

Hi, depth of REF and ALT alleles indeed can be used to infer genotype. However, the accuracy of inference could be low in this way, due to the high noise in the sequencing data (e.g., sequencing error). For instance, in your first example, if the one read supporting ALT allele is an artifact arising from sequencing error, then the truth genotype could be 0/0.

To account for sequencing error in genotyping, the error model presented in Table 1 of Jun et al., 2012 uses a parameter $e$ indicating occurrence of "Base Calling Error Event". Likihood can be simply treated as possibility. For each SNP, the likelihoods of three genotypes ("0/0", "1/0", "1/1") are calculated by aggregating the information provided by all bases/alleles (from pileup all supporting reads/UMIs) and their corresponding sequencing qualities (reflecting probability of sequencing error), modified from Equation 1 in Jun et al., 2012. The final reported genotype is the one with maximum likelihood.

jiehuichen commented 7 months ago

Got you, many thanks.