single-cell-genetics / cellsnp-lite

Efficient genotyping bi-allelic SNPs on single cells
https://cellsnp-lite.readthedocs.io
Apache License 2.0
124 stars 11 forks source link

Use cellSNP on a big BAM file #111

Open RosaDeSa opened 7 months ago

RosaDeSa commented 7 months ago

Hi! I would be grateful if you can answer my question. I am trying to use the cellSNP line tool on a large bam file (1.2 T) generated by SMARTseq. I aim to count all the possible variants in a specific gene. Do you think I should use the mode 2? Can the analysis be limited to a specific gene? I'm trying this command:

cellsnp-lite -s myfile.bam -b barcodes.txt -O ./results --chrom 3 --minMAF 0.1 --minCOUNT 100 -p 20 -f chr3_genome.fa

But the vcf in output is empty.

here the header of my bam file:

A00560:334:HLKGKDSX7:1:2272:3631:22216  163 3   10001   1   2S55M93S    =   10284   326 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTCACCCATTGCGCAATCTGTCTCTTATACACATCTGACGCTGCCGACGACGGCACTGAAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAGGGGGG  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFF:FF,FF:,F:F:,,,:FFFF  NH:i:4  HI:i:1  AS:i:96 nM:i:0  BX:Z:TGATCTGCACCGGCACTGAA   BC:Z:TGATCTGCACCGGCACTGAA   QB:Z:FFFFFFFFFFFFFFFFFFFF   QU:Z:FFFFFFFFFF ES:Z:Unassigned_NoFeatures  IS:Z:Unassigned_NoFeatures  UX:Z:GGTGAGGGTT UB:Z:GGTGAGGGTT
A00560:334:HLKGKDSX7:2:2674:26458:12994 163 3   10001   0   26S54M70S   =   10389   490 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTCACCATTGCGCAATCTGTCTCTTATACACATCTGACGCTGCCGACGACTCGTTAACGGTGTAGATCTCGGTGGT  :FFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFF:FFFFF,:FFFFF:FFFFF:FFFFFFFFFF:FFF:::FFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFFFFF,FFFFFFFFFFF,  NH:i:40 HI:i:1  AS:i:116    nM:i:1  BX:Z:ACAATGGACACTCGTTAACG   BC:Z:ACAATGGACACTCGTTAACG   QB:Z:FFFFFFFFFFFFFFFFFFFF   QU:Z:FFFFFFFFFF ES:Z:Unassigned_NoFeatures  IS:Z:Unassigned_NoFeatures  UX:Z:GTGAGGGTTA UB:Z:GTGAGGGTTA
A00560:334:HLKGKDSX7:3:1412:5348:2206   163 3   10001   3   8S54M88S    =   10284   331 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTCACCATTGCGCAATCTGTCTCTTATACACATCTGACGCTGCCGACGACTAGTAGTGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAGG  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFF:FF:FFFFFFF::FFF:FFFFFFFFFFFFF,F,FFFFFFFFFF::F,FFFFFFFFFFF:F:F:FF:FFFF,,,  NH:i:2  HI:i:1  AS:i:100    nM:i:0  BX:Z:CGCAGTTGTTCTAGTAGTGA   BC:Z:CGCAGTTGTTCTAGTAGTGA   QB:Z:FFFFFFFFFFFFFFFFFFFF   QU:Z:FFFFFFFFFF ES:Z:Unassigned_NoFeatures  IS:Z:Unassigned_NoFeatures  UX:Z:GTGAGGGTTA UB:Z:GTGAGGGTTA

I really appreciate any help you can provide. Best

hxj5 commented 7 months ago

Hi, thanks for the question.

Cellsnp-lite does not have an option for genotyping a specific region (gene) now. As your BAM is indeed large, mode 2a could be very slow and you may try mode 2b + 1a instead, i.e., calling first in a bulk manner by mode 2b followed by genotyping in mode 1a.

Alternatively, you may also run mode 1a directly if you can prepare a VCF listing every position of that gene, i.e., specify the CHROM, POS, REF as it is and set the ALT as "." in the VCF file (or set both REF and ALT as ".", and then use -f option to extract the reference allele from the FASTA file). Cellsnp-lite would assign the allele with the highest UMI/read counts except REF as the ALT.

Generally, the mode 2b + 1a is recommended. Please check the example of each mode in the manual.

Is your merged BAM file from SMART-seq3? As far as I know, SMART-seq2 will produce one BAM file per cell. The empty output could be due to the unmatched cell tag. The defualt cell tag of cellsnp-lite is "CB", which is missing in the three BAM records you shared. You may set proper cell tag with --cellTAG option, based on your experimental settings.