single-cell-genetics / cellsnp-lite

Efficient genotyping bi-allelic SNPs on single cells
https://cellsnp-lite.readthedocs.io
Apache License 2.0
124 stars 11 forks source link

miscount alleles single-cell ATAC-seq #95

Closed raul-w closed 1 year ago

raul-w commented 1 year ago

Dear developers of cellsnp-lite,

I ran cellsnp-lite in Mode 1a on a single-cell ATAC-seq dataset aligned with CellRanger-ATAC, using a set of pre-defined heterozygous SNPs produced by Deepvariant. However, the allele counts reported by cellsnp-lite do not correspond to those observed when looking at the alignments in IGV by eye. At position B1_RagTag:26190 for example, cellsnp-lite reports a DP of 31 and AD of 6, while I observe a DP of 18 and AD of 3 when looking at this position in IGV (duplicate read alignments hidden from view):

B1_RagTag_26190_non_duplicate_alignments view).

Do you know what is going wrong here? Below is a link to a small test case in which I observed the issue described above. It contains the input files, the used command (test_het_snps_cellsnp_lite_snps_only_B1_RagTag_1_150000_command_and_version.txt), the stdout/stderr (test_het_snps_cellsnp_lite_snps_only_B1_RagTag_1_150000.log), and the resulting output directory (test_cellsnp_lite_snps_only_B1_RagTag_1_150000).

https://websafe.mpipz.mpg.de/d/vaT4VqayW5/test_case_cell_snp_lite_miscounting_alleles.tar.gz

Kind regards,

Raúl

hxj5 commented 1 year ago

Hi Raúl,

Thanks for the detailed information. Unfortunatelly, I can not reproduce the IGV figure you shared. It seems the chromosome names in the BAM file are not commonly used. When I loaded the BAM file into IGV, it complains those chromosomes in BAM do not match any chromosomes in the reference genome (here I used hg38). Do you know how to fix it? Thanks.

I also have a quick check of the number of fetched reads on position B1_RagTag:26190. It turns out that there are 32 reads left after I filter reads with FLAG UNMAP, SECONDARY, QCFAIL, DUP and MAPQ<20. The command I used is: samtools view -F 1796 -q 20 test_cellsnp_lite_B1_RagTag_1_150000.bam B1_RagTag:26190-26190 | wc -l.

According to the log file you shared, main parameters used by cellsnp-lite for read filtering are: min_len=30, min_mapq=20, rflag_filter=1796, which means the reads with aligned length<30nt, MAPQ<20, or FLAG with any of "UNMAP, SECONDARY, QCFAIL, DUP" would be filtered. You may want to double check whether the IGV settings matched those read-filtering parameters.

Best regards, Xianjie

raul-w commented 1 year ago

Hi Xianjie,

Sorry, I should have included the FASTA file with the B1_RagTag sequence, as this is not a human dataset. You can get it through the following link: https://websafe.mpipz.mpg.de/d/xzDkwvxffS/B1_RagTag.fa

Thanks for the suggestion about the flags, I will check them out on Monday. it might be that cellranger-atac uses different flags to mark duplicates than the ones expected by cellsnp-lite,

Kind regards,

Raúl

hxj5 commented 1 year ago

Hi, I have successfully loaded the BAM into IGV using the fasta file you shared. After modifying the parameters in IGV Preferences -> Alignments to match read-filtering parameters used by cellsnp (especially setting "Mapping quality threshold" to 20 from my side; Figure 1), the number of reads in IGV seem close to AD & DP produced by cellsnp (Figure 2).

Figure 1

image

Figure 2

image (1)

raul-w commented 1 year ago

Hi Xianjie,

Changing the IGV settings (specifically changing the mapping threshold from 30 to 20) indeed made the numbers almost match up. The small difference that remained could be explained by the fact that cellsnp-lite does not count reads that lack the CB tag, while IGV still displays those. E.g. for position B1_RagTag:26190:

samtools view -F 1796 -q 20 test_cellsnp_lite_B1_RagTag_1_150000.bam B1_RagTag:26190-26190 | wc -l

32 reads (shown in IGV)

samtools view -F 1796 -q 20 test_cellsnp_lite_B1_RagTag_1_150000.bam B1_RagTag:26190-26190 | grep "CB:Z.*-1" | wc -l

31 reads (DP reported by cellsnp-lite)

I confirmed that the same thing happens at two other positions which show different read counts between IGV and cellsnp-lite.

So this issue has been solved, thanks a lot for your help!

Kind regards,

Raúl