sate-dev / sate-core

3 stars 3 forks source link

Mafft aligned looks at previous alignment length not unaligned sequence #32

Closed smirarab closed 11 years ago

smirarab commented 12 years ago

in tools.py, for MaffAligner we have:

 if len(alignment) <= 200 and alignment.max_sequence_length() < 10000:
            invoc.extend(['--localpair', '--maxiterate', '1000'])

This check is to make sure if input sequences are very long, we use the default mafft instead of localpair version.

There is a subtle issue. alignment passed here is the aligned sequence from previous iterations (possibly a very gappy input alignment from user). Therefore, max_sequence_length() will return the length of aligned sequences from the previous iteration, and not the unaligned sequences that are going to be aligned in this iteration. This is unfortunate because the actual sequence might be say just 3000bp long, but a gappy inital alignment with 10000 sites will results in the use of the suboptimal mafft.

I think this test should be based on unaligned sequence length.

Thanks Siavash

joaks1 commented 11 years ago

Hi Siavash,

Thank you for raising this issue. Commit 4a62bdafee811723c5df50cfbdfdd5b0094edb05 should solve this issue and provide the expected behavior. I modified the max_sequence_length() method to ignore/strip gap characters ('-') when determining the longest sequence in the alignment.

If you could pull this latest change and confirm that it results in the correct behavior, that would be great!

Thanks again,

Jamie

joaks1 commented 11 years ago

Commit 387b6ce2160f84786a53294065a42b0f601115c7 provides a (weak) test to confirm the Alignment.max_sequence_length() is returning the length of the longest unaligned sequence. I am closing this issue, but if there are problems with the new behavior, please let me know.

Thanks,

Jamie