mengyao / Complete-Striped-Smith-Waterman-Library

298 stars 112 forks source link

different traceback than emboss water #24

Closed jeffdaily closed 2 years ago

jeffdaily commented 9 years ago

Using the third sequence from the 100k_illumina1.fastq.gz and the Virus_genome.fa.gz, match of 5, mismatch of 4, open of 10, extend of 1. This is the SSW traceback (modified to report 50 characters per line):

Target:      775    TGACCATT--CACCACATTGGTG-TGCACCTC-CAAGCTGGGTACCAGCT    820
                    ||*|||*|  |*|||    |*|| *|||**|| ||||||*|*||*|||||
Query:         7    TGCCCAGTGCCTCCA----GCTGAAGCAGATCTCAAGCTTGCTAGCAGCT    52

Target:      821    GCTAGCAAGCTTGAGATCTGCTTCAGC----TGGAGGCACTGGGCA---G    863
                    |*||*|*|||||| ||**||| *||*|    |||*|  |*|||*||   |
Query:        53    GGTACCCAGCTTG-GAGGTGC-ACACCAATGTGGTG--AATGGTCATATG    98

Target:      864    G----TA-AGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATA    904
                    |    || *|||||*||      ||*||  *||||**|   |||||
Query:        99    GCGTTTATTGTATCGAG------AGGCA--CTTAAATA---CAATA    133

Here is the emboss water traceback, same scoring criteria.

EMBOSS_001       775 TGACCATT--CACCACATTGGTG-TGCACCTC-CAAGCTGGGTACCAGCT    820
                     ||.|||.|  |.|||    |.|| .|||..|| ||||||.|.||.|||||
EMBOSS_001         7 TGCCCAGTGCCTCCA----GCTGAAGCAGATCTCAAGCTTGCTAGCAGCT     52

EMBOSS_001       821 GCTAGCAAGCTTGAGATCTGCTTCAGC----TGGAGGCACTGGGCA---G    863
                     |.||.|.|||||| ||..||| .||.|    |||.|  |.|||.||   |
EMBOSS_001        53 GGTACCCAGCTTG-GAGGTGC-ACACCAATGTGGTG--AATGGTCATATG     98

EMBOSS_001       864 G----TA-AGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATA    904
                     |    || .|||||.||      ||.||  .||||  |.| |||||
EMBOSS_001        99 GCGTTTATTGTATCGAG------AGGCA--CTTAA--ATA-CAATA    133

Here is a visual diff thanks to xxdiff. screen shot 2015-07-15 at 9 06 43 am

Both tracebacks produce a score of 160, but the percent identity is slightly different. Which one is "correct"?

mengyao commented 9 years ago

Dear Jeff,

Thans a lot for sending me this example.

Recently, I find an error of ssw. This error sometimes happen during long read alignment. When you flip target and query sequences, this error disappear.

Based on this condition, I suggest you believe EMBOSS water’s result.

To make sure of the alignment result, you can switch target and query sequences for ssw and compare the result to EMBOSS water’s result.

I’m working on fixing this error. I’ll let you know, when it is fixed.

Yours,

Mengyao

On Jul 15, 2015, at 12:10 PM, Jeff Daily notifications@github.com wrote:

Using the third sequence from the 100k_illumina1.fastq.gz x-msg://22/blob/master/demo/100k_illumina1.fastq.gz and the Virus_genome.fa.gz x-msg://22/blob/master/demo/Virus_genome.fa.gz, match of 5, mismatch of 4, open of 10, extend of 1. This is the SSW traceback (modified to report 50 characters per line):

Target: 775 TGACCATT--CACCACATTGGTG-TGCACCTC-CAAGCTGGGTACCAGCT 820 |||||| |||| ||| _|||_|| |||||||||||||| Query: 7 TGCCCAGTGCCTCCA----GCTGAAGCAGATCTCAAGCTTGCTAGCAGCT 52

Target: 821 GCTAGCAAGCTTGAGATCTGCTTCAGC----TGGAGGCACTGGGCA---G 863 ||||_|||||| ||_||| *||| |||| ||||*|| | Query: 53 GGTACCCAGCTTG-GAGGTGC-ACACCAATGTGGTG--AATGGTCATATG 98

Target: 864 G----TA-AGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATA 904 | || ||||||| |||| *||||*| ||||| Query: 99 GCGTTTATTGTATCGAG------AGGCA--CTTAAATA---CAATA 133 Here is the emboss water traceback, same scoring criteria.

EMBOSS_001 775 TGACCATT--CACCACATTGGTG-TGCACCTC-CAAGCTGGGTACCAGCT 820 ||.|||.| |.||| |.|| .|||..|| ||||||.|.||.||||| EMBOSS_001 7 TGCCCAGTGCCTCCA----GCTGAAGCAGATCTCAAGCTTGCTAGCAGCT 52

EMBOSS_001 821 GCTAGCAAGCTTGAGATCTGCTTCAGC----TGGAGGCACTGGGCA---G 863 |.||.|.|||||| ||..||| .||.| |||.| |.|||.|| | EMBOSS_001 53 GGTACCCAGCTTG-GAGGTGC-ACACCAATGTGGTG--AATGGTCATATG 98

EMBOSS_001 864 G----TA-AGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATA 904 | || .|||||.|| ||.|| .|||| |.| ||||| EMBOSS_001 99 GCGTTTATTGTATCGAG------AGGCA--CTTAA--ATA-CAATA 133 Here is a visual diff thanks to xxdiff. https://cloud.githubusercontent.com/assets/904248/8703184/ddffb9a4-2ad0-11e5-855d-126def22d86e.png Both tracebacks produce a score of 160, but the percent identity is slightly different. Which one is "correct"?

— Reply to this email directly or view it on GitHub https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library/issues/24.

jeffdaily commented 8 years ago

I recently implemented a non-vectorized SW and got the same traceback as SSW, not emboss water. I wonder now if emboss is incorrect? Near the point that the tracebacks differ, SSW is

TTTAAGGAGACCAATA
*||||**|   |||||
CTTAAATA---CAATA

and emboss is

TTTAAGGAGACCAATA
.||||  |.| |||||
CTTAA--ATA-CAATA

They both score the same (26 by my calculation). But which one is more biologically relevant? I would think the single, longer gap produced by SSW makes more sense than the second lone indel in emboss. Your thoughts?

mengyao commented 8 years ago

Dear Jeff,

For this example, I agree that SSW’s result is better biologically.

Since the score given by both implementations are the same, mathematically these results are equivalent.

From this result, we cannot say emboss is incorrect. We can only say the SW algorithm is not a perfect model for the genome alignment. (Alternative alignment paths are allowed, and SW doesn’t guarantee the best path.)

I think SSW does have the error that you mentioned. I didn’t get time to work on that, but I will fix it sooner or later.

Many thanks,

Mengyao

On Sep 22, 2015, at 4:00 PM, Jeff Daily notifications@github.com wrote:

I recently implemented a non-vectorized SW and got the same traceback as SSW, not emboss water. I wonder now if emboss is incorrect? Unless I did the math wrong, the emboss traceback results in a lower overall score. Near the point that the tracebacks differ, SSW is

TTTAAGGAGACCAATA ||||*| ||||| CTTAAATA---CAATA and emboss is

TTTAAGGAGACCAATA .|||| |.| ||||| CTTAA--ATA-CAATA They both score the same (26 by my calculation). But which one is more biologically relevant? I would think the single, longer gap produced by SSW makes more sense than the second lone indel in emboss. Your thoughts?

— Reply to this email directly or view it on GitHub https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library/issues/24#issuecomment-142402284.

mengyao commented 8 years ago

Dear Jeff,

Please try this one: http://lh3lh3.users.sourceforge.net/bioseq.shtml

This is an ordinary SW, and I believe it gives correct results. You can use this to test others.

Till now, I only find one example input (a pair of ~3K bp sequences with penalties: match 1, mismatch -1, gap open -1, gap extension -1) that SSW gives wrong result. If you switch target and query, the result turns right. Please let me know, if you find any incorrect case of SSW. This will help me to fix it.

Yours,

Mengyao