Removing gappy sequences

Dear Benedikt,

Thanks for contacting me about trimAl - I hope I can answer your question.

I created the script get_sequences_gaps_ratio.py as a quick mechanism to estimate the proportion of gaps per sequence rather than the usual trimAl's behaviour of doing it per column.

Thus, if you run the script (from the scripts folder in this repository) with an input alignment file - I'm using one of the provided files for this example.

./get_sequences_gaps_ratio.py -i ../dataset/example.005.AA.fasta

... you will get the following answer (please, note that the input file is in FASTA format, you can indicate other input formats using the -f parameter).

   0    Sp8                               0.8333
   1    Sp17                              0.6667
   2    Sp10                              0.5000
   3    Sp26                              0.3333
   4    Sp33                              0.1667
   5    Sp6                               0.0000

The columns are the sequence index in the input file, the sequence name and the proportion of gaps per sequence.

If you want to identify sequences equal or above a given threshold, let say 0.5, you can use the script like this ...

./get_sequences_gaps_ratio.py -i ../dataset/example.005.AA.fasta --threshold 0.5

... which will produce the following output

   0    Sp8                               0.8333
   1    Sp17                              0.6667
   2    Sp10                              0.5000

As you can remove sequences from any given alignment indicating their index, you can request to get just the sequences' index equal or higher than a given threshold like this ...

./get_sequences_gaps_ratio.py -i ../dataset/example.005.AA.fasta --threshold 0.5 --show_only_index

... which will produce the following output.

0,1,2

Then, you can combine everything in a single command-line using the trimAl's parameter -selectseqs { n,l,m-k } like this ... (probably it is not the prettiest way to do it).

trimal -in ../dataset/example.005.AA.fasta -selectseqs { $(./get_sequences_gaps_ratio.py -i ../dataset/example.005.AA.fasta --threshold 0.5 --show_only_index) }

... which results in this ...

>Sp8
NGLQIHMMGIII------------------------------------------------
---------------------------
>Sp17
NGLQIHMMGIIIIIIIIIIIIIIIIII---------------------------------
---------------------------
>Sp10
NGLQIHMMGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII------------------
---------------------------
>Sp26
NGLQIHMMGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII---
---------------------------
>Sp33
NGLQIHMMGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
IIIIIIIIIIII---------------
>Sp6
NGLQIHMMGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
IIIIIIIIIIIIIIIIIIIIIIIIIII

Please, let me know if this helps (or new ways to handle your alignment). I'm always interested in getting to know how people use trimAl and how we can do it better.

With best regards,

inab / trimal

Removing gappy sequences #58