mengyao / Complete-Striped-Smith-Waterman-Library

298 stars 113 forks source link

Case with different alignment for plus/plus vs. minus/minus #59

Closed mathog closed 6 years ago

mathog commented 6 years ago

Greetings.

Here is an example of a pair of sequences which give a slightly different alignment depending on the sequence directions. In the CIGAR sequence it results in M23 on one end for one direction, and M24 on that end in the other direction. The rest of the CIGAR sequence is the same (albeit in the other order.) One of these alignments is offset by 1bp which isn't obvious from the CIGAR sequence but is evident if it is converted to any other format. Note the alignment positions on the contig sequence, 0-542 in one case, 1-543 in the other. That is the same length but the CIGAR string is 1bp longer, resulting in an offset. The M23 vs. M24 change isn't that big a deal, but the associated incorrect 1bp offset is. The weights shown here are not critical - the default weights have the same issue. "-G" is a local modification to the test program which only prints the CIGAR string.

ssw_test -G /tmp/contig_476464_rev.fasta /tmp/pb_31509.fasta
M19D1M14D1M17I1M27I1M1I2M23D2M29D4M2D3M4D1M1D3M16D1M2D1M19I1M20D1M6I1M3D1M34D1M12D1M6D1M7I1M9D2M20D1M5D1M19D1M41I1M3D1M30D2M7D2M19D1M14D1M3I2M5I2M15I1M26I2M7I3M24

ssw_test -G /tmp/contig_476464.fasta /tmp/pb_31509_rev.fasta
M23I3M7I2M26I1M15I2M5I2M3D1M14D1M17D2M10D2M27D1M5I1M38D1M20D1M7D1M20D2M9I1M7D1M7D1M11D1M33D1M5I1M6D1M20I1M19D1M2D1M13D2M2D4M3D4M3D1M31D2M19I4M4D1M26I1M18D1M14D1M19

Print result or result_rc (in main.c) running in gdb for various commands:

ssw_test -R -G -m 2 -x 5 -o 2 -e 1 /pb_31509.fasta /contig_476464.fasta
$2 = {score1 = 871, score2 = 600, ref_begin1 = 7016, ref_end1 = 7542, read_begin1 = 0, read_end1 = 542, ref_end2 = 7814, cigar = 0x622150, cigarLen = 99}

ssw_test -G -m 2 -x 5 -o 2 -e 1 /pb_31509.fasta /contig_476464_rev.fasta
$1 = {score1 = 871, score2 = 600, ref_begin1 = 7016, ref_end1 = 7542, read_begin1 = 0, read_end1 = 542, ref_end2 = 7814, cigar = 0x61f6d0, cigarLen = 99}

ssw_test -G -m 2 -x 5 -o 2 -e 1 /pb_31509_rev.fasta /contig_476464.fasta
$1 = {score1 = 871, score2 = 604, ref_begin1 = 4473, ref_end1 = 4999, read_begin1 = 1, read_end1 = 543, ref_end2 = 5271, cigar = 0x61f6d0, cigarLen = 98}

ssw_test -R -G -m 2 -x 5 -o 2 -e 1 /contig_476464.fasta /pb_31509.fasta
$1 = {score1 = 871, score2 = 0, ref_begin1 = 1, ref_end1 = 543, read_begin1 = 4473, read_end1 = 4999, ref_end2 = 0, cigar = 0x6814d0, cigarLen = 100}

ssw_test -G -m 2 -x 5 -o 2 -e 1 /contig_476464_rev.fasta /pb_31509.fasta
$1 = {score1 = 871, score2 = 0, ref_begin1 = 0, ref_end1 = 542, read_begin1 = 7016, read_end1 = 7542, ref_end2 = 0, cigar = 0x64d500, cigarLen = 98}

ssw_test -G -m 2 -x 5 -o 2 -e 1 /contig_476464.fasta /pb_31509_rev.fasta
$1 = {score1 = 871, score2 = 0, ref_begin1 = 1, ref_end1 = 543, read_begin1 = 4473, read_end1 = 4999, ref_end2 = 0, cigar = 0x64d2f0, cigarLen = 100}

This is the difference in the alignments, which consists of a single base end gap which is included in one direction but not the other. When it is included the end points are offset by that 1 bp.

    m130116_09555   4451 GAATTCAGAACCACTAGACAACGGACGAGCCCAAACAGTTTGAACACCCC   4500
                                               .|||||||||||||||||||||||   |
    contig_476464      1 ----------------------CGACGAGCCCAAACAGTTTGAACA---C     25
                                                -------23bp------------    vs.
                                               --------24bp------------

Example files are attached.

examples.tar.gz

mathog commented 6 years ago

Note in the ssw_test examples above that the 3rd and 4th effectively just have ref and read swapped. All the coordinates move as expected, but the cigar length of the two is not the same. In fact there are 3 different cigar lengths observed in this set of examples. Why not just 1???

mengyao commented 6 years ago

Hi,

Thank you very much for sending me this example.

I cannot open the examples.tar.gz. Would you mind to have a check of the attachment and send it to me again? Or you can send me the two sequences that have the problem.

I guess, if you switch the positions of the two .fasta files in your ssw_test command, it will also show the different results. I mean you don't need to reverse the sequences to generate this error. Please have a try.

Look forward to your reply.

Yours,

Mengyao

On Fri, Apr 27, 2018 at 1:27 PM mathog notifications@github.com wrote:

Greetings.

Here is an example of a pair of sequences which give a slightly different alignment depending on the sequence directions. In the CIGAR sequence it results in M23 on one end for one direction, and M24 on that end in the other direction. The rest of the CIGAR sequence is the same (albeit in the other order.) One of these alignments is offset by 1bp which isn't obvious from the CIGAR sequence but is evident if it is converted to any other format. Note the alignment positions on the contig sequence, 0-542 in one case, 1-543 in the other. That is the same length but the CIGAR string is 1bp longer, resulting in an offset. The M23 vs. M24 change isn't that big a deal, but the associated incorrect 1bp offset is. The weights shown here are not critical - the default weights have the same issue. "-G" is a local modification to the test program which only prints the CIGAR string.

ssw_test -G /tmp/contig_476464_rev.fasta /tmp/pb_31509.fasta M19D1M14D1M17I1M27I1M1I2M23D2M29D4M2D3M4D1M1D3M16D1M2D1M19I1M20D1M6I1M3D1M34D1M12D1M6D1M7I1M9D2M20D1M5D1M19D1M41I1M3D1M30D2M7D2M19D1M14D1M3I2M5I2M15I1M26I2M7I3M24

ssw_test -G /tmp/contig_476464.fasta /tmp/pb_31509_rev.fasta M23I3M7I2M26I1M15I2M5I2M3D1M14D1M17D2M10D2M27D1M5I1M38D1M20D1M7D1M20D2M9I1M7D1M7D1M11D1M33D1M5I1M6D1M20I1M19D1M2D1M13D2M2D4M3D4M3D1M31D2M19I4M4D1M26I1M18D1M14D1M19

Print result or result_rc (in main.c) running in gdb for various commands:

ssw_test -R -G -m 2 -x 5 -o 2 -e 1 /pb_31509.fasta /contig_476464.fasta $2 = {score1 = 871, score2 = 600, ref_begin1 = 7016, ref_end1 = 7542, read_begin1 = 0, read_end1 = 542, ref_end2 = 7814, cigar = 0x622150, cigarLen = 99}

ssw_test -G -m 2 -x 5 -o 2 -e 1 /pb_31509.fasta /contig_476464_rev.fasta $1 = {score1 = 871, score2 = 600, ref_begin1 = 7016, ref_end1 = 7542, read_begin1 = 0, read_end1 = 542, ref_end2 = 7814, cigar = 0x61f6d0, cigarLen = 99}

ssw_test -G -m 2 -x 5 -o 2 -e 1 /pb_31509_rev.fasta /contig_476464.fasta $1 = {score1 = 871, score2 = 604, ref_begin1 = 4473, ref_end1 = 4999, read_begin1 = 1, read_end1 = 543, ref_end2 = 5271, cigar = 0x61f6d0, cigarLen = 98}

ssw_test -R -G -m 2 -x 5 -o 2 -e 1 /contig_476464.fasta /pb_31509.fasta $1 = {score1 = 871, score2 = 0, ref_begin1 = 1, ref_end1 = 543, read_begin1 = 4473, read_end1 = 4999, ref_end2 = 0, cigar = 0x6814d0, cigarLen = 100}

ssw_test -G -m 2 -x 5 -o 2 -e 1 /contig_476464_rev.fasta /pb_31509.fasta $1 = {score1 = 871, score2 = 0, ref_begin1 = 0, ref_end1 = 542, read_begin1 = 7016, read_end1 = 7542, ref_end2 = 0, cigar = 0x64d500, cigarLen = 98}

ssw_test -G -m 2 -x 5 -o 2 -e 1 /contig_476464.fasta /pb_31509_rev.fasta $1 = {score1 = 871, score2 = 0, ref_begin1 = 1, ref_end1 = 543, read_begin1 = 4473, read_end1 = 4999, ref_end2 = 0, cigar = 0x64d2f0, cigarLen = 100}

This is the difference in the alignments, which consists of a single base end gap which is included in one direction but not the other. When it is included the end points are off by that 1 bp.

m130116_09555   4451 GAATTCAGAACCACTAGACAACGGACGAGCCCAAACAGTTTGAACACCCC   4500
                                           .|||||||||||||||||||||||   |
contig_476464      1 ----------------------CGACGAGCCCAAACAGTTTGAACA---C     25
                                            -------23bp------------    vs.
                                           --------24bp------------

Example files are attached.

examples.tar.gz https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library/files/1955991/examples.tar.gz

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library/issues/59, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlVdAdD0ZjGcALJA8Y0h2FydxxSZqA_ks5ts1TvgaJpZM4TqvhZ .

mathog commented 6 years ago

Let's see if this one works. pieces.zip

mathog commented 6 years ago

Here is a slightly newer main.c which fixes a couple of edge cases. newer_main.tar.gz

mengyao commented 6 years ago

Thank you. I think they work. I also found the sequences that show wrong result in an older message from you.

Yours,

Mengyao

On Thu, Jul 19, 2018 at 5:19 PM mathog notifications@github.com wrote:

Here is a slightly newer main.c which fixes a couple of edge cases. newer_main.tar.gz https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library/files/2211690/newer_main.tar.gz

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library/issues/59#issuecomment-406417760, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlVdIiWz4arza6B8NnDbZtc3oYbYAvwks5uIPfVgaJpZM4TqvhZ .

mengyao commented 6 years ago

Hi,

I had a look at the alignment results of the sequences you mentioned, and find they are not errors.

For SW, different alignments can give the same score, and these alignments are equivalent. That means all these different cigars are all correct, and they are all the best alignments.

For example, the following two alignments are equivalent:

AGCGCA--CAG----AACACATATA

AGCCCAAACAGTTTGAACACAT-TA

AGCGC--ACA----GAACACATATA

AGCCCAAACAGTTTGAACACAT-TA

They give the same alignment score, and are both best alignments. However, their cigars are different. The different cigars can be caused by different implementations of the SW algorithm (use > or >= in some equations to change the preference of the paths) or how do you feed the sequences to the program (such as what you did).

Yours,

Mengyao

On Fri, Jul 20, 2018 at 9:55 AM 孟瑶赵 zhaomengyao@gmail.com wrote:

Thank you. I think they work. I also found the sequences that show wrong result in an older message from you.

Yours,

Mengyao

On Thu, Jul 19, 2018 at 5:19 PM mathog notifications@github.com wrote:

Here is a slightly newer main.c which fixes a couple of edge cases. newer_main.tar.gz https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library/files/2211690/newer_main.tar.gz

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library/issues/59#issuecomment-406417760, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlVdIiWz4arza6B8NnDbZtc3oYbYAvwks5uIPfVgaJpZM4TqvhZ .

mathog commented 6 years ago

I understand what you are saying, highroad, lowroad and everything in between. However the CIGAR paths shown in the first post differ only in one "clause" (or whatever the right terminology is) on one end, there are no compensating changes elsewhere in the alignment to make things add up.

What are the alignments you get for the forward and reverse cases? The alignment I think they are are shown in the first post, and they differ in the classification of just the one base, which is included (M24) or excluded (M23). The M24 alignment, as far as I can tell, starts with a single bp gap on the end, and is everywhere else the same as the M23 alignment. The M24 should have a lower SW score than the M23 because of that one gap open and one gap extend.