inab / trimal

A tool for automated alignment trimming in large-scale phylogenetic analyses. Development version: 2.0
https://trimal.readthedocs.io/
GNU General Public License v3.0
175 stars 41 forks source link

Removing gappy sequences #58

Closed BenKuhnhaeuser closed 4 years ago

BenKuhnhaeuser commented 4 years ago

Hi Salvador, I have noted that you have written a new script get_sequences_gaps_ratio.py that allows to remove sequences from alignments based on the proportion of gaps. I have cloned your latest repository, but trimAl doesn’t seem to recognize the new --threshold and --show_only_index flags.

Would you be so kind to give me a hint on how to get this script to work?

scapella commented 4 years ago

Dear Benedikt,

Thanks for contacting me about trimAl - I hope I can answer your question.

I created the script get_sequences_gaps_ratio.py as a quick mechanism to estimate the proportion of gaps per sequence rather than the usual trimAl's behaviour of doing it per column.

Thus, if you run the script (from the scripts folder in this repository) with an input alignment file - I'm using one of the provided files for this example.

./get_sequences_gaps_ratio.py -i ../dataset/example.005.AA.fasta

... you will get the following answer (please, note that the input file is in FASTA format, you can indicate other input formats using the -f parameter).

   0    Sp8                               0.8333
   1    Sp17                              0.6667
   2    Sp10                              0.5000
   3    Sp26                              0.3333
   4    Sp33                              0.1667
   5    Sp6                               0.0000

The columns are the sequence index in the input file, the sequence name and the proportion of gaps per sequence.

If you want to identify sequences equal or above a given threshold, let say 0.5, you can use the script like this ...

./get_sequences_gaps_ratio.py -i ../dataset/example.005.AA.fasta --threshold 0.5

... which will produce the following output

   0    Sp8                               0.8333
   1    Sp17                              0.6667
   2    Sp10                              0.5000

As you can remove sequences from any given alignment indicating their index, you can request to get just the sequences' index equal or higher than a given threshold like this ...

./get_sequences_gaps_ratio.py -i ../dataset/example.005.AA.fasta --threshold 0.5 --show_only_index

... which will produce the following output.

0,1,2

Then, you can combine everything in a single command-line using the trimAl's parameter -selectseqs { n,l,m-k } like this ... (probably it is not the prettiest way to do it).

trimal -in ../dataset/example.005.AA.fasta -selectseqs { $(./get_sequences_gaps_ratio.py -i ../dataset/example.005.AA.fasta --threshold 0.5 --show_only_index) }

... which results in this ...

>Sp8
NGLQIHMMGIII------------------------------------------------
---------------------------
>Sp17
NGLQIHMMGIIIIIIIIIIIIIIIIII---------------------------------
---------------------------
>Sp10
NGLQIHMMGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII------------------
---------------------------
>Sp26
NGLQIHMMGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII---
---------------------------
>Sp33
NGLQIHMMGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
IIIIIIIIIIII---------------
>Sp6
NGLQIHMMGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
IIIIIIIIIIIIIIIIIIIIIIIIIII

Please, let me know if this helps (or new ways to handle your alignment). I'm always interested in getting to know how people use trimAl and how we can do it better.

With best regards,

S