guanchangge / mosaik-aligner

Automatically exported from code.google.com/p/mosaik-aligner
0 stars 0 forks source link

BAM/SAM output clipping is not identified in Cigar string #69

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Prepare reference and data:
  MosaikBuild -fr pNL4_3.fasta -oa pNL4_3.ref.dat
  MosaikBuild -fr pNL4_3_reads.fa -fq pNL4_3_reads.qual -out test.dat -tn 500 -st 454
  2. Align
   MosaikAligner -in test.dat -out test_aligned.dat -ia pNL4_3.ref.dat -hs 15 -mm 0.05 -act 55

3. Generate bam and/or sam:
   MosaikSort -in test_aligned.dat -out test_aligned_sorted.dat
   MosaikText -in test_aligned_sorted.dat -sam test_aligned_sorted.sam

What is the expected output? What do you see instead?
Expected - cigar string with hard (H) or soft (S) clipping indicated where the 
read has been trimmed.
Actual no clipping indicated in cigar string.

What version of the product are you using? On what operating system?
Mosaik 1.1.0013 (or 1.0.1388) Linux

Please provide any additional information below.

BAM/SAM output has clipped aligned reads, but clipping is not identified in 
Cigar string

Here is an example (from the source I previously sent):

The read  is defined in the input pNL4_3_reads.fa file as follows:
> >F3GSHBH01DJJYC
> AATGGCCAATTGACAGACAATGGCCATTGACAGACAATGGCCATTG

When it comes out of mosaik, it has lost 1 leading base and several trailing 
bases, however, these aren't indicated in the cigar string:

> F3GSHBH01DJJYC    0    pNL4_3_Assembly    1810    14    6M1D27M    *    *    
*    ATGGCCAATTGACAGACAATGGCCATTGACAGA    <<<<884444<<<676<==>88889<<?>>=>?    
RG:Z:ZD5D0JISOWI    NM:i:6

Lining up the two input and output strings you can see this clearly
AATGGCCAATTGACAGACAATGGCCATTGACAGACAATGGCCATTG    <- from input fasta
 ATGGCCAATTGACAGACAATGGCCATTGACAGA                <- from output sam

The cigar string needs something added (presumably a hard-clip H operator) at 
the beginning and end to indicate the missing bases.
In other words, it should be 1H6M1D27M12H
It would also be desirable for the complete read sequence to be in the output, 
and the soft clipping S operator used.

This was using the MosaikAligner command:
    MosaikAligner -in test.dat -out test_aligned.dat -ia ref.dat -act 15 -mm 500 -mmp 0.25 -mmal -minp 0.25 -gop 15 -hgop 4 -gep 6.66 -m unique 

Original issue reported on code.google.com by martha.b...@gmail.com on 7 Oct 2010 at 9:30

GoogleCodeExporter commented 8 years ago

Original comment by WanPing....@gmail.com on 27 Oct 2010 at 3:39

GoogleCodeExporter commented 8 years ago
This issue doesn't seem to be resolved, and recent distributions don't even 
seem to include MosaikText.  Is BAM/SAM conversion no longer supported?  If so, 
it might be a good idea to remove "Robust support for the SAM & BAM alignment 
file formats." from the website.

Original comment by Delphi....@gmail.com on 1 Feb 2011 at 4:22

GoogleCodeExporter commented 8 years ago
Has any progress been made on this in either direction?

Original comment by dan.kort...@adelaide.edu.au on 1 Apr 2012 at 10:02