single-cell-genetics / cellsnp-lite

Efficient genotyping bi-allelic SNPs on single cells
https://cellsnp-lite.readthedocs.io
Apache License 2.0
124 stars 11 forks source link

definition of REF and ALT in mode 2 #28

Closed bobermayer closed 2 years ago

bobermayer commented 2 years ago

Hi,

I noticed that in mode 2 (with --genotype) the definition of REF and ALT is not what I expected. it looks like REF is the major and ALT the minor allele, such that the overall allele fraction (ratio of row sums of AD by row sums of DP) is always < 0.5. however, neither is necessarily identical to the actual (genomic) reference. this can cause confusion downstream, e.g., when comparing with lists of SNPs from other sources (where I think REF is always the reference allele). is this the intended behavior, and can it be changed? I'm using cellsnp-lite v1.2.0

thanks!

hxj5 commented 2 years ago

Hi, thanks for your feedback. As is mentioned in the cellsnp-lite paper that cellsnp-lite mode 2 takes the allele with the highest count as REF and the second highest as ALT, with little input information about the actual (genomic) reference. This is different from mode 1, which uses the REF and ALT alleles specified in the input VCF. Yes, it could cause confusion in some cases as you mentioned. For now, in these cases, you have to write some scripts to fix-ref for both the output matrices and VCF (if exists). Sorry for that.

This issue had been added into the TODO list before, and we are trying to fix it (e.g., by adding a cmdline option for Fasta file so that we can extract the actual reference for mode 2) in next release.

bobermayer commented 2 years ago

hi, thanks for quick reply and the explanation, makes sense!

hxj5 commented 2 years ago

Update: since v1.2.2, we have the -f or --refseq option for this issue.