Closed BenKuhnhaeuser closed 4 years ago
Dear Benedikt,
Thanks for contacting me about trimAl - I hope I can answer your question.
I created the script get_sequences_gaps_ratio.py as a quick mechanism to estimate the proportion of gaps per sequence rather than the usual trimAl's behaviour of doing it per column.
Thus, if you run the script (from the scripts folder in this repository) with an input alignment file - I'm using one of the provided files for this example.
./get_sequences_gaps_ratio.py -i ../dataset/example.005.AA.fasta
... you will get the following answer (please, note that the input file is in FASTA format, you can indicate other input formats using the -f parameter).
0 Sp8 0.8333 1 Sp17 0.6667 2 Sp10 0.5000 3 Sp26 0.3333 4 Sp33 0.1667 5 Sp6 0.0000
The columns are the sequence index in the input file, the sequence name and the proportion of gaps per sequence.
If you want to identify sequences equal or above a given threshold, let say 0.5, you can use the script like this ...
./get_sequences_gaps_ratio.py -i ../dataset/example.005.AA.fasta --threshold 0.5
... which will produce the following output
0 Sp8 0.8333 1 Sp17 0.6667 2 Sp10 0.5000
As you can remove sequences from any given alignment indicating their index, you can request to get just the sequences' index equal or higher than a given threshold like this ...
./get_sequences_gaps_ratio.py -i ../dataset/example.005.AA.fasta --threshold 0.5 --show_only_index
... which will produce the following output.
0,1,2
Then, you can combine everything in a single command-line using the trimAl's parameter -selectseqs { n,l,m-k } like this ... (probably it is not the prettiest way to do it).
trimal -in ../dataset/example.005.AA.fasta -selectseqs { $(./get_sequences_gaps_ratio.py -i ../dataset/example.005.AA.fasta --threshold 0.5 --show_only_index) }
... which results in this ...
>Sp8 NGLQIHMMGIII------------------------------------------------ --------------------------- >Sp17 NGLQIHMMGIIIIIIIIIIIIIIIIII--------------------------------- --------------------------- >Sp10 NGLQIHMMGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII------------------ --------------------------- >Sp26 NGLQIHMMGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII--- --------------------------- >Sp33 NGLQIHMMGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII IIIIIIIIIIII--------------- >Sp6 NGLQIHMMGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII IIIIIIIIIIIIIIIIIIIIIIIIIII
Please, let me know if this helps (or new ways to handle your alignment). I'm always interested in getting to know how people use trimAl and how we can do it better.
With best regards,
S
Hi Salvador, I have noted that you have written a new script
get_sequences_gaps_ratio.py
that allows to remove sequences from alignments based on the proportion of gaps. I have cloned your latest repository, but trimAl doesn’t seem to recognize the new--threshold
and--show_only_index
flags.Would you be so kind to give me a hint on how to get this script to work?