single-cell-genetics / cellsnp-lite

Efficient genotyping bi-allelic SNPs on single cells
https://cellsnp-lite.readthedocs.io
Apache License 2.0
124 stars 11 forks source link

Large 10x bam files #25

Closed AnjaliC4 closed 2 years ago

AnjaliC4 commented 2 years ago

Hi, I have very large10x cellranger bams (about 60-70gb) produced using their BWA-MEM default parameters. 10x default pipeline produces bam files that include the unmapped, chimeric, and duplicate regions. However, the fragment files they produce have the deduplicted and paired fragments but Cellsnp requires a bam file. The problem is that it is taking me more than two days to run 70g bam on cellsnp-lite. Could you please recommend how could I fasten it and which flags I should include or exclude from the 10x produced bam file. My understanding is that --exclFLAG parameter has the following set to default: UNMAP,SECONDARY,QCFAIL,DUP. Should I also include --inclFLAG 2 (2 for reads mapped in proper pairs)? Please note I have non UMI single-cell data. Please let me know your thoughts. Thank you very much!

hxj5 commented 2 years ago

Yes, cellsnp-lite requires bam/sam/cram file as input. For 10x non-UMI single-cell data, cellsnp-lite would filter reads of type UNMAP, SECONDARY, QCFAIL, or DUP by default if --UMItag None is set. The reads not mapped in proper pairs would also be filtered by default if --countORPHAN is not used (i.e. probably no need to include --inclFLAG 2).

Seems there would be little improvements by changing --exclFLAG or --inclFLAG if the reads were well-flagged by upstream cellranger. But you may want to try replacing -R with -T if a SNP list was given and -R was used, as sometimes -T could be faster.

AnjaliC4 commented 2 years ago

Ok thank you very much for your suggestions. I read in the manual that -T Similar as -R, but the next position is accessed by streaming rather than indexing/jumping (like -T in samtools/bcftools mpileup). So can I assume that the results will be equivalent whether I use -R or -T? Another option I thought was to use MAPQ>30 rather than >20 - but in your experience, does setting it 20 result in more cells being demultiplexed? And finally could you please suggest how can I check if these --exclFLAG flags of cellSNP (UNMAP, SECONDARY, QCFAIL, or DUP) are indeed present in my 10x bam file? Thank you.

hxj5 commented 2 years ago

Hi, the -R and -T should give the same result.

Using MAPQ>30 or >20 is expected to give little difference for the result of demultiplexing.

samtools flagstat could be used for an overview of the number of alignments for each FLAG type. e.g., samtools flagstat <in.bam>

or use samtools view -f/-F/-G for more detailed results. e.g., samtools view -f 1024 <in.bam> | less would print all reads of DUP type (duplicates).