single-cell-genetics / cellsnp-lite

Efficient genotyping bi-allelic SNPs on single cells
https://cellsnp-lite.readthedocs.io
Apache License 2.0
124 stars 11 forks source link

UMI Collapsing #121

Closed grasshoffm closed 3 months ago

grasshoffm commented 3 months ago

Hello to the cellsnp-lite team,

I wanted to ask how cellSNP-lite performs the UMI collapsing?

Best regards,

Martin

hxj5 commented 3 months ago

Hi Martin,

Thanks for the question. Firstly, cellsnp-lite relies on the UMI collaping output from upstream tools (e.g., CellRanger would correct sequencing errors within UMI sequences prior to UMI counting). Secondly, for SNP pileup in one UMI (i.e., a group of reads from the same source RNA molecular), cellsnp-lite currently uses the allele extracted from the first read as its consensus allele. The strategy is simple but practically effective, thanks to the technical advances with decreasing sequencing errors, while it can be optimized by considering the sequences and corresponding qualities from all reads.

Best, Xianjie

grasshoffm commented 3 months ago

Hi Xianjie,

Thanks for your swift reply. That explains everything.

Best,

Martin

HenriettaHolze commented 1 month ago

@hxj5 Hi, I'm working on single-cell long-read data for which amplification of specific genes was performed. In that case, PCR errors can occur and there can be reads that are only partially amplified etc. If only the first read per transcript is considered, this could be suboptimal. We get 60 and more reads for a single transcript.
Do you have a recommendation how to handle this? Would it make sense to sort reads by alignment score (AS) tag first, e.g. by samtools sort -t AS (increasing order), then reverse order with tac?

hxj5 commented 1 month ago

Hi, thanks for the question. Cellsnp-lite was designed for short reads. It may not fit well if the PCR & sequencing error rates of your long-read data are much higher than short reads. As to sorting by AS, IMPO, it seems more reasonable to sort by sequencing qualities of individual alleles of target SNPs, which is actually performing UMI collapsing correction. I would suggest using pileup/genotyping tools tailored for long-read data, or short-read tools considering UMI collapsing (e.g., vartrix if I remember correctly).

HenriettaHolze commented 1 month ago

Thanks a lot for your recommendation!