stajichlab / PHYling

PHYling pipeline for species tree building from annotated genomes (see https://github.com/stajichlab/AAFTF and https://github.com/nextgenusfs/funannotate for assembly and annotation steps)
MIT License
7 stars 3 forks source link

uppercase option #14

Open hyphaltip opened 1 year ago

hyphaltip commented 1 year ago

while the lowercase sequences in an alignment match are useful output from HMMER it doens't mean much for the phylogenetic analyses. Tools like VeryFastTree ignore lowercase so it would be better if there was an option to force uppercase sequence writing when reporting concatenated or single alignments I think. Or make uppercase force the default (common use) but allow user to turn this off.

Command: VeryFastTree -double-precision concat_alignments.aa.mfa
VeryFastTree Version 4.0.3 (OpenMP, SSE) Double precision with SSE3
Alignment: concat_alignments.aa.mfa
Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000
Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits: 1.00*sqrtN close=default refresh=0.80
ML Model: Jones-Taylor-Thorton, CAT approximation with 20 rate categories
Ignored unknown character a (seen 15053031 times)
Ignored unknown character c (seen 696310 times)
Ignored unknown character d (seen 8021126 times)
Ignored unknown character e (seen 8620113 times)
Ignored unknown character f (seen 2958267 times)
Ignored unknown character g (seen 9532367 times)
Ignored unknown character h (seen 2355402 times)
Ignored unknown character i (seen 3133116 times)
Ignored unknown character k (seen 5589872 times)
Ignored unknown character l (seen 9940350 times)
Ignored unknown character m (seen 1777397 times)
Ignored unknown character n (seen 2948421 times)
Ignored unknown character p (seen 8696418 times)
Ignored unknown character q (seen 4590872 times)
Ignored unknown character r (seen 7790859 times)
Ignored unknown character s (seen 10591681 times)
Ignored unknown character t (seen 6746813 times)
Ignored unknown character v (seen 6743874 times)
Ignored unknown character w (seen 1089347 times)
Ignored unknown character y (seen 2028882 times)
chtsai0105 commented 1 year ago

This was being addressed by the phylotree module, all sequences are forced uppercase before sending to VeryFastTree. https://github.com/stajichlab/PHYling/blob/4da4a3a74fba3ec962a1e089e1a021591d3a06be/src/phyling/phylotree.py#L43-L45 But we can definitely do that earlier in output msa results. I was just not sure if the lowercase is meaningful so I simply preserve it.