gt1 / biobambam2

Tools for early stage alignment file processing
Other
93 stars 17 forks source link

bamclipreinsert error when the sequence contains a single base with a quality of 9 #39

Open srl147 opened 7 years ago

srl147 commented 7 years ago

If after clipping the sequence in a bam file has a single base with a quality of 9, which corresponds to a '*' in the sam format, bamclipreinsert doesn't restore the quality values properly

I've created two bam files, containing a single cluster, which illustrate the problem. input2.bam in which the clipped forward read has a single base with a quality value of 9 and fixed2.bam where the quality value has been changed to 10.

Compare bamclipreinsert < input2.bam | samtools view bamclipreinsert < fixed2.bam | samtools view

We actually noticed this error because bam12auxmerge failed with the error 'Invalid quality value', when we tried re-inserting the clipped bases. The sam representation of the resulting bam file is somewhat different to that produced by bamclipreinsert but I assume the cause is the same.

srl147 commented 7 years ago

the links didn't insert correctly ftp://ngs.sanger.ac.uk/scratch/project/team117/srl/input2.bam ftp://ngs.sanger.ac.uk/scratch/project/team117/srl/fixed2.bam

whitwham commented 7 years ago

So, by the SAM spec, it is kind of doing the right thing. A single '*' in the quality value stands for a lack of quality data. ClipAdaptor treats sequence and quality as text, so after clipping the single '*' gets interpreted as a lack of quality data and gets stored as a 255 in BAM format.

When ClipReinsert pulls the data out it reads the 255, adds 33 to get the right ascii value and so (because it is effectively a char type) becomes a space (ascii 32).

A space is not a valid quality value and causes bam12auxmerge to fail. It is an interesting edge case.

srl147 commented 7 years ago

After some discussion we suggest the following as a simple fix. The code which clips adapter already checks for reads that are 100% adapter, in which case it will not clip the read. Can this check be extended to also skip reads that would contain only one base after clipping.

whitwham commented 7 years ago

That looks like the simplest fix. I will get that done.