Closed jeffdaily closed 2 years ago
Dear Jeff,
Thans a lot for sending me this example.
Recently, I find an error of ssw. This error sometimes happen during long read alignment. When you flip target and query sequences, this error disappear.
Based on this condition, I suggest you believe EMBOSS water’s result.
To make sure of the alignment result, you can switch target and query sequences for ssw and compare the result to EMBOSS water’s result.
I’m working on fixing this error. I’ll let you know, when it is fixed.
Yours,
Mengyao
On Jul 15, 2015, at 12:10 PM, Jeff Daily notifications@github.com wrote:
Using the third sequence from the 100k_illumina1.fastq.gz x-msg://22/blob/master/demo/100k_illumina1.fastq.gz and the Virus_genome.fa.gz x-msg://22/blob/master/demo/Virus_genome.fa.gz, match of 5, mismatch of 4, open of 10, extend of 1. This is the SSW traceback (modified to report 50 characters per line):
Target: 775 TGACCATT--CACCACATTGGTG-TGCACCTC-CAAGCTGGGTACCAGCT 820 |||||| |||| ||| _|||_|| |||||||||||||| Query: 7 TGCCCAGTGCCTCCA----GCTGAAGCAGATCTCAAGCTTGCTAGCAGCT 52
Target: 821 GCTAGCAAGCTTGAGATCTGCTTCAGC----TGGAGGCACTGGGCA---G 863 ||||_|||||| ||_||| *||| |||| ||||*|| | Query: 53 GGTACCCAGCTTG-GAGGTGC-ACACCAATGTGGTG--AATGGTCATATG 98
Target: 864 G----TA-AGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATA 904 | || ||||||| |||| *||||*| ||||| Query: 99 GCGTTTATTGTATCGAG------AGGCA--CTTAAATA---CAATA 133 Here is the emboss water traceback, same scoring criteria.
EMBOSS_001 775 TGACCATT--CACCACATTGGTG-TGCACCTC-CAAGCTGGGTACCAGCT 820 ||.|||.| |.||| |.|| .|||..|| ||||||.|.||.||||| EMBOSS_001 7 TGCCCAGTGCCTCCA----GCTGAAGCAGATCTCAAGCTTGCTAGCAGCT 52
EMBOSS_001 821 GCTAGCAAGCTTGAGATCTGCTTCAGC----TGGAGGCACTGGGCA---G 863 |.||.|.|||||| ||..||| .||.| |||.| |.|||.|| | EMBOSS_001 53 GGTACCCAGCTTG-GAGGTGC-ACACCAATGTGGTG--AATGGTCATATG 98
EMBOSS_001 864 G----TA-AGTATCAAGGTTACAAGACAGGTTTAAGGAGACCAATA 904 | || .|||||.|| ||.|| .|||| |.| ||||| EMBOSS_001 99 GCGTTTATTGTATCGAG------AGGCA--CTTAA--ATA-CAATA 133 Here is a visual diff thanks to xxdiff. https://cloud.githubusercontent.com/assets/904248/8703184/ddffb9a4-2ad0-11e5-855d-126def22d86e.png Both tracebacks produce a score of 160, but the percent identity is slightly different. Which one is "correct"?
— Reply to this email directly or view it on GitHub https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library/issues/24.
I recently implemented a non-vectorized SW and got the same traceback as SSW, not emboss water. I wonder now if emboss is incorrect? Near the point that the tracebacks differ, SSW is
TTTAAGGAGACCAATA
*||||**| |||||
CTTAAATA---CAATA
and emboss is
TTTAAGGAGACCAATA
.|||| |.| |||||
CTTAA--ATA-CAATA
They both score the same (26 by my calculation). But which one is more biologically relevant? I would think the single, longer gap produced by SSW makes more sense than the second lone indel in emboss. Your thoughts?
Dear Jeff,
For this example, I agree that SSW’s result is better biologically.
Since the score given by both implementations are the same, mathematically these results are equivalent.
From this result, we cannot say emboss is incorrect. We can only say the SW algorithm is not a perfect model for the genome alignment. (Alternative alignment paths are allowed, and SW doesn’t guarantee the best path.)
I think SSW does have the error that you mentioned. I didn’t get time to work on that, but I will fix it sooner or later.
Many thanks,
Mengyao
On Sep 22, 2015, at 4:00 PM, Jeff Daily notifications@github.com wrote:
I recently implemented a non-vectorized SW and got the same traceback as SSW, not emboss water. I wonder now if emboss is incorrect? Unless I did the math wrong, the emboss traceback results in a lower overall score. Near the point that the tracebacks differ, SSW is
TTTAAGGAGACCAATA ||||*| ||||| CTTAAATA---CAATA and emboss is
TTTAAGGAGACCAATA .|||| |.| ||||| CTTAA--ATA-CAATA They both score the same (26 by my calculation). But which one is more biologically relevant? I would think the single, longer gap produced by SSW makes more sense than the second lone indel in emboss. Your thoughts?
— Reply to this email directly or view it on GitHub https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library/issues/24#issuecomment-142402284.
Dear Jeff,
Please try this one: http://lh3lh3.users.sourceforge.net/bioseq.shtml
This is an ordinary SW, and I believe it gives correct results. You can use this to test others.
Till now, I only find one example input (a pair of ~3K bp sequences with penalties: match 1, mismatch -1, gap open -1, gap extension -1) that SSW gives wrong result. If you switch target and query, the result turns right. Please let me know, if you find any incorrect case of SSW. This will help me to fix it.
Yours,
Mengyao
Using the third sequence from the 100k_illumina1.fastq.gz and the Virus_genome.fa.gz, match of 5, mismatch of 4, open of 10, extend of 1. This is the SSW traceback (modified to report 50 characters per line):
Here is the emboss water traceback, same scoring criteria.
Here is a visual diff thanks to xxdiff.
Both tracebacks produce a score of 160, but the percent identity is slightly different. Which one is "correct"?