pantherdb / TreeGrafter

TreeGrafter is a new software tool for annotating uncharacterized protein sequences, using annotated phylogenetic trees.
http://pantherdb.org
GNU General Public License v3.0
10 stars 7 forks source link

Ignore less reliable HMMer domain results (noted with '?') in reconstructing alignment #7

Open dustine32 opened 4 years ago

dustine32 commented 4 years ago

Attempting to graft this sequence:

>Cyanophora_paradoxa_CPAR027107_Apc11
QKTLTILAKDRNYKVEDFKAAGAIAKTRLDQQREPCSCKVAASDAHPCVRRVLFLNLSAA
VGAREPRLGARRAPALRSMKVKIVWHAVASWTWNVDDEACGICRNAYDGCCPDCKTPGDD
CPLWGECRHAFHLHCILKWVNSQQEGKQHCPMCRRDWKFRSSD

...onto the PANTHER 15.0 library, TreeGrafter outputs this error:

ERROR MSF of Cyanophora_paradoxa_CPAR027107_Apc11 should have length 90, actual length is 203

Debugging what's going on, the treeGrafter.pl script appears to be parsing the hmmscan output for the top hit to PTHR11210 incorrectly and this causes the reconstruction of the query sequence alignment in TreeGrafter to not match the alignment length of the PTHR11210 family PIR file:

image Specifically, the script recognizes that this hit has two domains and uses that count in iterating through start/end alignment values. Unfortunately, a regex for /!/ used in parsing out those start/end values causes the first domain ? to be skipped and the wrong values are used. We'll need to debug further to figure out how to line these parts all up together correctly.