mengyao / Complete-Striped-Smith-Waterman-Library

298 stars 112 forks source link

Alignments with inconsistent CIGAR/sequence length #40

Closed insectopalo closed 2 years ago

insectopalo commented 7 years ago

When running the C program to outuput a SAM file,

ssw_test -r region_of_interest.fa -c -s -h 3553-CT_goldenreads.fastq > alignment.sam

I've noted that the SAM file that does not comply with the SAM format specification:

"Sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ" [1].

Example from actual output:

HWI-ST1309F:275:C8E2LANXX:3:1101:10013:85607 16 chrRCRS:6500-14600 2688 4 74=4I1X4=1I2X1=1X2=1D2=3I4=1X2=1I2=18S * 0 0 TACCTGCACGACAACACATAATGACCCACCAATCACATGCCTATCATATAGTAAAACCCAGCCCATGACCCCTATGCCTCAGGATACTCTTCAATAGCCATCGCT F7<</<B7<<<FF/FBFB/FFFB/FFFFFFFFFBFF7<F/FBFFF<BBFFFFFFFFBFFFFBFBBFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFBFFBBBBB AS:i:152 NM:i:124 ZS:i:142

The length of the sequence reported in that entry is 105:

len(TACCTGCACGACAACACATAATGACCCACCAATCACATGCCTATCATATAGTAAAACCCAGCCCATGACCCCTATGCCTCAGGATACTCTTCAATAGCCATCGCT) = 105

The CIGAR string is 74=4I1X4=1I2X1=1X2=1D2=3I4=1X2=1I2=18S which means 74+4+1+4+1+2+1+1+2+2+3+4+1+2+1+2+18=123. It seems that the soft-clipped residues are not being reported in the SEQ field.

Cheers, Gon

[1] https://samtools.github.io/hts-specs/SAMv1.pdf

mengyao commented 7 years ago

Dear Gon,

I apologize for the late reply.

Thank you for pointing this problem out. I've fixed this error. Please check the latest version.

Yours,

Mengyao

On Wed, Sep 14, 2016 at 8:19 AM, Gon S. Nido notifications@github.com wrote:

When running the C program to outuput a SAM file,

ssw_test -r region_of_interest.fa -c -s -h 3553-CT_goldenreads.fastq > alignment.sam

I've noted that the SAM file that does not comply with the SAM format specification:

"Sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ" [1].

Example from actual output:

HWI-ST1309F:275:C8E2LANXX:3:1101:10013:85607 16 chrRCRS:6500-14600 2688 4 74=4I1X4=1I2X1=1X2=1D2=3I4=1X2=1I2=18S * 0 0 TACCTGCACGACAACACATAATGACCCACCAATCACATGCCTATCATATAGTAAAACCCAGCCCATGACCCCTATGCCTCAGGATACTCTTCAATAGCCATCGCT F7<</<B7<<<FF/FBFB/FFFB/FFFFFFFFFBFF7<F/FBFFF<BBFFFFFFFFBFFFFBFBBFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFBFFBBBBB AS:i:152 NM:i:124 ZS:i:142

The length of the sequence reported in that entry is 105:

len(TACCTGCACGACAACACATAATGACCCACCAATCACATGCCTATCATATAGTAAAACCCAGCCCATGACCCCTATGCCTCAGGATACTCTTCAATAGCCATCGCT) = 105

The CIGAR string is 74=4I1X4=1I2X1=1X2=1D2=3I4=1X2=1I2=18S which means 74+4+1+4+1+2+1+1+2+2+3+4+1+2+1+2+18=123. It seems that the soft-clipped residues are not being reported in the SEQ field.

Cheers, Gon

[1] https://samtools.github.io/hts-specs/SAMv1.pdf

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library/issues/40, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlVdNt2rizsfo-gz5OL-9vZE3KBPk0vks5qp-ZhgaJpZM4J8tTU .