swiftbiosciences / primerclip

Swift Accel-Amplicon primer trimming tool for fast alignment-based primer trimming
Other
10 stars 6 forks source link

Primerclip removes almost everything #3

Open chapplec opened 5 years ago

chapplec commented 5 years ago

I am trying to implement primerclip as part of my pipeline. However, when testing it on a small sam file, it removes all but two sequences:

## The bam file was generated using bwa-mem
$ samtools view -h sample.bam > sample.sam
$ grep -c '^[^@]' sample.sam
18308
$ primerclip Acel-Amplicon_56g_masterfile.txt sample.sam sample.clipped.sam 
all 263 master records parsed successfully.

primer trimming complete.
$ grep -c '^[^@]' sample.clipped.sam 
2
$ tail -n3 sample.clipped.sam 
@PG     ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem -t 48 -R @RG\tID:short_L001_001\tSM:primerTest\tPL:Illumina\tPU:L001 /genomes/hg19/hg19.fa reads/short_L001_R1_001.fastq.gz reads/short_L001_R2_001.fastq.gz
M01296:233:000000000-ACMBP:1:1110:16093:15306   83      chr1    4409915 60      120M    =       4409916 -119    GTTTATTGCAGCACTGTTCACAATAGCAAAGACTTGGAACCAACCCAAATGCCCATCAATGATAGACTGGATAAAGAAAATATGGCACATATACACCATGGAATACTATGCAGCCATAAA   HHHHHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHGGGGGGHHHHHHHGHHHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGFHHHHGGGGGGGGGGBBFBFFFBBBB   NM:i:0  MD:Z:120        MC:Z:120M       AS:i:120        XS:i:110   RG:Z:short_L001_001     CO:Z:primer_trimmed
M01296:233:000000000-ACMBP:1:1110:16093:15306   163     chr1    4409916 60      120M    =       4409915 119     TTTATTGCAGCACTGTTCACAATAGCAAAGACTTGGAACCAACCCAAATGCCCATCAATGATAGACTGGATAAAGAAAATATGGCACATATACACCATGGAATACTATGCAGCCATAAAA   ABBCFFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHGHHGGGGHGHHHHHHHHIHHHHHHHHHHHHHGHHHHHHHHHHHHHHHHHHHFHHHHHHHHHHHHHHHHHHHHHHHHHHHGH   NM:i:0  MD:Z:120        MC:Z:120M       AS:i:120        XS:i:111   RG:Z:short_L001_001     CO:Z:primer_trimmed

I get the same behavior using both the official bwa mem and the Sentieon implementation of the same tool (https://www.sentieon.com/products/) and on two different samples. However, neither sample was actually sequenced using the Swift kit whose primer file I am testing. Perhaps this is the source of the problem, but that would seem strange. Why remove sequences if they don't match the primers? More importantly, why are sequences being removed at all? I was expecting the primers to be soft clipped, or hard clipped, but not the removal of entire reads!

What am I doing wrong here?

irishjonathan commented 5 years ago

Hi Chapplec,

Did you find that name sorting your input SAM resolved your issue with truncated output?

chapplec commented 5 years ago

Hi, and sorry it took me so long to answer, I was working on something else. So, the input file I was using was created from an already sorted bam file, so I had been assuming that meant it was sorted. However, I tried again after sorting by name and this time I get an empty file:

$ samtools sort -n  sample.sam > sample.sorted.sam
$ grep -c '^[^@]' sample.sorted.sam 
2767369
$ primerclip Acel-Amplicon_56g_masterfile.txt sample.sorted.sam sample.sorted.clipped.sam 
all 263 master records parsed successfully.

primer trimming complete.
$ wc sample.sorted.clipped.sam 
0 0 0 sample.sorted.clipped.sam
jonca79 commented 4 years ago

For me it only worked after sorting by name (samtools sort -n). That the input should be in this format should be documented somewhere!