rigdenlab / conkit

Contact Prediction ToolKit
https://www.conkit.org/en/latest/
BSD 3-Clause "New" or "Revised" License
20 stars 8 forks source link

FASTA conversion into A3M should be gapless #98

Open FilomenoSanchez opened 2 years ago

FilomenoSanchez commented 2 years ago

The query sequence in the A3M format should be gapless as discussed in #96. The hhstuite provides a script reformat.pl capable of dealing with this, take a look and try to add this into conkit.

sadiogo commented 2 years ago

I was thinking about this code and it might be more simple than it looks. Here's how to do it:

  1. Parse the sequences in the alignment, convert them into list and store them in a tuple. Create a variable called gap_postions = [].

  2. Find all the gap indexes in the first sequence, store them in gap_positions and then remove the gaps using the pop() method.

  3. In all the other non-query sequences, run a for loop and inquire if the position in gap_positions is equal to a gap. If Yes, remove it using pop(), else convert letter into lowercase.

That's it. Of course, there might be more efficient ways to do it, but those are the basics.

However, there are many programs that don't use the lowercase letters (which indicate insertions relative to the query sequence) and wrongly ask for alignments in the a3m format (which necessarily must display the insertions). Thus, conkit_convert could also provide an output without insertions, in which case you just need to remove the letters instead of converting them to lowercase.