mengyao / Complete-Striped-Smith-Waterman-Library

298 stars 112 forks source link

wrong score for very large sequences #9

Closed tolot27 closed 2 years ago

tolot27 commented 10 years ago

I've a very large protein sequence (the largest known) with 36851 aa (inkl. hard masked regions)

The self alignment is stopped at 5220 and the optimal alignment score is the maximun of signed short (16 bit):

./ssw_test -o 13 -e 2 -p longest.fa longest.fa
optimal_alignment_score: 32767  suboptimal_alignment_score: 32767   strand: +   target_end: 5220    query_end: 5220

It looks like there are some hard-coded limits. There are much more sequences larger than 5220 aa (i. e. titins). Please can you investigate this?

A simple test can be made with V5FZX2, which is just 5573 aa, but also gets not correctly scored.

mengyao commented 10 years ago

Dear Mathias,

Thank you for your email.

Yes, as you see the limitation of SSW is the largest score it can reach is 2^15 - 1.

SSW splits the 128bit register into 16 pieces (8bit for each piece to hold a number), so it can calculate 16 numbers in parallel. When the reads are long, it changes to split the register into 8 pieces(16bit / piece). Theoretically the largest score one piece can hold should be 2^16 - 1. The reason why SSW used the int16_t but not uint16_t is: very unfortunately, SSE2 doesn’t have the command _mm_max_epu16 (unsigned calculation), instead it only has _mm_max_epi16 (signed calculation).

SSE4 does have _mm_max_epu16, so in the future SSW can be improved to hold larger scores.

Currently, considering the limited usage of SSE4 CPUs, I would like to keep SSW as what it likes now.

However, there is still a way that you can revise SSW easily to make it able to align your long sequences: replace all ‘epi’ comments with ‘epu’.

If your machine can compile the codes without any error, SSW may give the correct result as you want.

Hope this helps.

Thank you for letting me know this problem any way. I will consider alternative solutions for long reads, and decide whether to make the improvement depends on it’s benefit and cost.

Moreover, the blast format is fixed. Thank you for letting me know.

Best wishes,

Mengyao

On Jan 26, 2014, at 4:21 AM, Mathias Walter notifications@github.com wrote:

I've a very large protein sequence (the largest known) with 36851 aa (inkl. hard masked regions)

The self alignment is stopped at 5220 and the optimal alignment score is the maximun of signed short (16 bit):

./ssw_test -o 13 -e 2 -p longest.fa longest.fa optimal_alignment_score: 32767 suboptimal_alignment_score: 32767 strand: + target_end: 5220 query_end: 5220 It looks like there are some hard-coded limits. There are much more sequences larger than 5220 aa (i. e. titins). Please can you investigate this?

A simple test can be made with V5FZX2, which is just 5573 aa, but also gets not correctly scored.

— Reply to this email directly or view it on GitHub.

tolot27 commented 10 years ago

Hi Mengyao,

are you aware of the emulating functions i. e. implemented in sseplus?

I. e. _mm_max_epu16 is emulated here: ssp_max_epu16_SSE2.

I don't know how much these emulations influences the performance but will test it.

tolot27 commented 10 years ago

It's not as simple as it look like. Just by changing _mm_max_epi16 to _mm_max_epu16 or using the emulated function does not return the correct value (for self-alignment of V5FZX2 it is 37049 if BL50 and -o 13 -e 2).

It returns:

optimal_alignment_score: 65535  strand: +       target_end: 1   query_end: 4

Thats the ushort boundery. Something went wrong. Any ideas?

Again, the titins are longer and the maximum self score is 219129. This will never ever fit into a 16 Bit unsigned variable.

mengyao commented 10 years ago

Dear Mathias,

Thank you for introducing me the “SSEPlus”. It’s a very interesting package.

I tried the ssp_max_epu16_SSE2, but I find it turns to add many commands actually in the inner loop of the SW score calculation.

I feel it will slow down the program, so decide don’t use it at this point.

Also, you are right the maximum score of the self-alignment of V5FZX2 cannot fit into a 16 bit variable any way.

SSW is not suitable for aligning long protein sequences at least now.

Maybe SSEARCH is more suitable for this task.

Wish you the best for your work.

Mengyao

On Jan 26, 2014, at 6:52 PM, Mathias Walter notifications@github.com wrote:

It's not as simple as it look like. Just by changing _mm_max_epi16 to _mm_max_epu16 or using the emulated function does not return the correct value (for self-alignment of V5FZX2 it is 37049 if BL50 and -o 13 -e 2).

It returns:

optimal_alignment_score: 65535 strand: + target_end: 1 query_end: 4 Thats the ushort boundery. Something went wrong. Any ideas?

Again, the titins are longer and the maximum self score is 219129. This will never ever fit into a 16 Bit unsigned variable.

— Reply to this email directly or view it on GitHub.

jeffdaily commented 9 years ago

@tolot27 I have a 32bit implementation of the SW algorithm that produces the correct score for V5FZX2 and your scoring criteria (B50, open 13, extend 2). I do not output the traceback, however. Would this still be of value to you?


Jeff Daily Scientist ADVANCED COMPUTING, MATHEMATICS, AND DATA DIVISION High Performance Computing Group

Pacific Northwest National Laboratory 902 Battelle Boulevard P.O. Box 999, MSIN J4-30 Richland, WA 99352 USA Tel: 509-372-6548 Fax: 509-372-4720 jeff.daily@pnnl.gov www.pnnl.gov