single-cell-genetics / cellsnp-lite

Efficient genotyping bi-allelic SNPs on single cells
https://cellsnp-lite.readthedocs.io
Apache License 2.0
124 stars 11 forks source link

How to check the real reads cover an SNP in a cell? #107

Closed jiehuichen closed 7 months ago

jiehuichen commented 7 months ago

Hi Xianjie,

Thanks for your great efforts on the cellsnp-lite.

Recently, I met an issue about checking the reads cover an SNP.

For "1/0:1:11:0:39,33,380:0,10,0,1,0", which means this cell is a heterozygote, how to check the real reads cover this SNP? Can we view these reads in IGV or just grab all the reads related to this cell's SNP by samtools?

Best,

hxj5 commented 7 months ago

Hi, yes, you can extract the reads by samtools and then view them in IGV.

To extract reads covering a SNP and output to a BAM file (assuming the SNP's position is chr1:100000):

samtools view -h -b  "input_BAM"  chr1:100000  >  "output_BAM"
samtoos index "output_BAM"

If you only want to extract SNP reads in specific cell (assuming cell barcode is XXX-1 and cell tag is CB):

samtools view -h -b  -d CB:XXX-1  "input_BAM"  chr1:100000  >  "output_BAM"
samtoos index "output_BAM"

Then you can load the output BAM file above into IGV to view the reads.

jiehuichen commented 7 months ago

Thanks Xianjie.

Since the bam file of my project is different from that of 10x, when I tried take codes you shared above to extract SNP reads in specific cell, there will be some warnings:

[bam_header_read] EOF marker is absent. The input is probably truncated. [bam_header_read] invalid BAM binary header (this is not a BAM file).

So I just simplified the codes as:

samtools view $input_BAM | less -S | grep "Cell_BarcodeXXX" > SNP1.txt

Then, I checked the sequence of WT and "Mut" in SNP1 region, cat SNP1.txt | grep "GCGCCTGAC" |wc -l cat SNP1.txt | grep "GCGCTTGAC" |wc -l

Compare the output with the information in "cellSNP.cells.vcf" by cellSNP-liste, e.g. "1/0:1:11:0:39,33,380:0,10,0,1,0", . I found some of them were not consistent, especially for those cells with less reads count in "cellSNP.cells.vcf".

By the way, during the SNP calling, I dit not use the parameter “--minMAF 0.1 --minCOUNT 20”.

According to your understanding, is it possible that some SNPs with low read count are unreliable/false positive?

If yes, should I always add the filtering “--minMAF 0.1 --minCOUNT 20” when using the cellsnp-lite to avoid some false positive SNPs?

Many thanks.

hxj5 commented 7 months ago

Hi, the inconsistency of read counts between samtools and cellsnp-lite is probably due to the different filtering settings of the two tools, e.g., by default, cellsnp-lite will filter some low-quality reads (please check --exclFLAG option) while samtools do not. To make the filtering settings the same, you can use -F option in samtools view.

Yes, generally speaking, adding "--minMAF 0.1 --minCOUNT 20" is recommended to reduce false positives.

jiehuichen commented 7 months ago

Got you, many thanks.