nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
312 stars 83 forks source link

update crashes due to Glycine-Threonine repeat protein #359

Closed eyalbenda closed 4 years ago

eyalbenda commented 4 years ago

Are you using the latest release? yes

Describe the bug I have what has to be a very exotic error, caused by a protein that has a long stretch of GT (Gly-Thr) repeats, and little else. This is the predicted protein:

FUN_000647-T1 FUN_000647 MLYHLALWLLDLSSSGTGTGTGTGTGTRTGGAGTGTGTGTGTGGTGGTGGTGGAGTGTGTVGTGGTGTGTGTGGAGTGTG TGGAGTGGAGTGTGTGTGGTGGTGGTGTGTVGTGGAGTGTGTGGAGTGGAGTGTGTGTGGTAGTGTGTGGTGTGTGTGTG GTGTGTGTGRTGGTGGTVGTGTGGTAGTGGTGGTGTVGTGGAGTGTGTGTGTGGTGGTVGTGTGGTGTGTGTNGTGGTGG TGGTGGTGTGTGSFP

The pasa step of funannotate update fails, and the pasa log points to a failed call to Pearson's fasta program (see below). I believe this error is very exotic - for reference, NCBI's blast refuses to search this protein, since it recognizes it as a DNA sequence instead of protein. I believe the best solution, in this case, would be to somehow remove the protein from predict_results and reinsert it to update_results manually. Is this possible?

Logfiles Log of the failed call to Pearson's fasta program

FASTA searches a protein or DNA sequence data bank version 36.3.8g Dec, 2017 Please cite: W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448

Warning - unrecognized residue at 1:L - 76 Warning - unrecognized residue at 4:L - 76 Warning - unrecognized residue at 6:L - 76 Warning - unrecognized residue at 8:L - 76 Warning - unrecognized residue at 9:L - 76 Warning - unrecognized residue at 11:L - 76 Warning - unrecognized residue at 253:F - 70 ERROR aa0[255] = [25 > 17] out of range ERROR *** validate_params() failed: -- /home/multivac/NanoSystem/anaconda37/envs/funannotate/bin/fasta /tmp/8553-f0b315f1-4637038-404652321.714655.seq1 /tmp/8553-f0b315f1-4637038-404652321.714655.seq2

seq1:

seq1 MLYHLALWLLDLSSSGTGTGTGTGTGTRTGGAGTGTGTGTGTGGTGGTGGTGGAGTGTGT VGTGGTGTGTGTGGAGTGTGTGGAGTGGAGTGTGTGTGGTGGTGGTGTGTVGTGGAGTGT GTGGAGTGGAGTGTGTGTGGTAGTGTGTGGTGTGTGTGTGGTGTGTGTGRTGGTGGTVGT GTGGTAGTGGTGGTGTVGTGGAGTGTGTGTGTGGTGGTVGTGTGGTGTGTGTNGTGGTGG TGGTGGTGTGTGSFP*%

seq2:

seq2 MTSSNAPGTGTGTGTGTGTRTGGAGTGTGTGTGTGGTGGTGGTGGAGTGTGTVGTGGTGT GTGTGGAGTGTGTGGAGTGGAGTGTGTGTGGTGGTGGTGTGTVGTGGAGTGTGTGGAGTG GAGTGTGTGTGGTAGTGTGTGGTGTGTGTGTGGTGTGTGTGRTGGTGGTVGTGTGGTAGT GGTGGTGTVGTGGAGTGTGTGTGTGGTGGTVGTGTGGTGTGTGTNGTGGTGGTGGTGGTG TGTGSFP*%

OS/Install Information Funannotate 1.7.1, installed with conda

nextgenusfs commented 4 years ago

Strange indeed. To fix the underlaying issue, it would better to open an issue in PASA -- as its possible adding -p option to the relevant fasta36 calls in PASA -- ie then presumably it won't try to predict the alphabet (DNA vs protein).

For an immediate work around you can use funannotate fix to drop that model causing issues, then re-run funannotate update, and then you'll have to manually add that gene model back via the NCBI tbl format and then run funanntoate fix again with the updated tbl file.

echo "FUN_000647" > model_drop.txt
funannotate fix -d model_drop.txt -i outfolder/predict_results/genome.gbk \
     -t outfielder/predict_results/genome.tbl

And then re-run update:

funannotate update -i outfielder

Then add back in that model from the original tbl format -- funannotate fix will generate an "archive" folder housing the original results, you can just copy/paste that gene model back into the new output from update. And then run fix script on the update_results files.

eyalbenda commented 4 years ago

Thank you for the help. It made me notice that fuannotate was messing up the species name. I think it could be because the species, "Dunaliella bardawil", isn't in ncbi, while a sister species, "Dunaliella salina", is. The gbk file has Dunaliella salina everywhere, and after running fuannanotate fix the files get renamed to that species. I guess I can use sed to change it back, but it appears to be a bug. I specified the full species name, with the quote marks, to both the train and predict commands using the -s flag. I'm rerunning the update command for now and will update further.

eyalbenda commented 4 years ago

Update: the fix allowed funannotate update to run successfully to finish. Please let me know if I should close the bug report or keep it open due to the naming issue.

nextgenusfs commented 4 years ago

Hmm, not sure exactly about the taxonomy issue -- the pipeline is running tbl2asn using taxonomy lookup, so I thought this would only grab the existing lineage and not necessarily change the genus species name. I'll have to look into the tbl2asn docs to see if this is their intended behavior or not.

Per the error above with the GT rich gene -- this should be addressed in the PASA code as it really isn't a funannotate bug.

hyphaltip commented 4 years ago

https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=3046&lvl=3&lin=f&keep=1&srchmode=1&unloc

D bardwelli is a synonym in ncbi. That’s why the names are switched

Jason Stajich, PhD jasonstajich.phd@gmail.com On Dec 23, 2019, 11:03 AM -0800, Jon Palmer notifications@github.com, wrote:

Hmm, not sure exactly about the taxonomy issue -- the pipeline is running tbl2asn using taxonomy lookup, so I thought this would only grab the existing lineage and not necessarily change the genus species name. I'll have to look into the tbl2asn docs to see if this is their intended behavior or not. Per the error above with the GT rich gene -- this should be addressed in the PASA code as it really isn't a funannotate bug. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

eyalbenda commented 4 years ago

I see. Thanks for the reply, I understand now. Happy holidays and thank you as always for the support!