steuernb / NLR-Annotator

NLR-Annotator upload
GNU General Public License v3.0
57 stars 24 forks source link

-a returns alignments of different lengths #33

Open bmansfeld opened 1 year ago

bmansfeld commented 1 year ago

Hi again, Love the -a option it really helps streamline any downstream pyhlogenetic analyses. I just wanted to note here something that at least in my hands seemed to be a bug.

When running the following command

java -Xmx56G -jar NLR-Annotator-v2.1b.jar -i ../genome.fasta -x mot.txt -y store.txt -a genome_NLR.nbarcMotifAlignment.fasta -t 40

and then running FastTree (from bioconda v2.1.11 https://anaconda.org/bioconda/fasttree) I received the following error:

FastTree Version 2.1.11 Double precision (No SSE3)
Alignment: Mfusca_NLR.nbarcMotifAlignment.fasta
Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000
Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits: 1.00*sqrtN close=default refresh=0.80
ML Model: Jones-Taylor-Thorton, CAT approximation with 20 rate categories
Wrong number of characters for genome_chr11_nlr3: expected 180 but have 161 instead.
This sequence may be truncated, or another sequence may be too long.

After further inspection several of the alignment rows had different lengths in most cases the number of - was different.

I corrected this using an awk 1-liner extending the rows to 200 chars (to be on the safe side)

awk '$1~">"{print $0}$1!~">"{tmp="";for(i=1;i<200-length($0)+1;i++){tmp=tmp"-"};print $0""tmp}' genome_NLR.nbarcMotifAlignment.fasta >genome_NLR.nbarcMotifAlignment.200.fasta

Running fasttree on this genome_NLR.nbarcMotifAlignment.200.fasta file yields no errors.

Let me know if you think i am doing something wrong here or mis-interpreting something. But just wanted to report. Ben

steuernb commented 1 year ago

Hi Ben, thanks for this report! This might indeed be a bug. I see if I can reproduce the error. Any chance you can share your input data? Burkhard

bmansfeld commented 1 year ago

Hey Burkhard, Thanks for looking in to this. The analysis was on our yet unpublished (preprint: https://www.biorxiv.org/content/10.1101/2023.03.22.533842v1) malus fusca genome. You can download the hap1 fasta from here: https://www.rosaceae.org/Analysis/15540543?pane=bio_data_2_rsc_assembly Let me know if there's anything else you need. Ben