nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
322 stars 85 forks source link

undeclared dependencies #158

Closed AnotherSimon closed 6 years ago

AnotherSimon commented 6 years ago

When running funannotate compare I get the following error message:

[Mar 26 03:39 PM]: Running funannotate v1.3.0-beta [Mar 26 03:39 PM]: Missing Dependencies: find_enrichment.py, mafft, raxmlHPC-PTHREADS, trimal, proteinortho5.pl, phyml. Please install missing dependencies and re-run script

Neither the install guide nor funannotate check mention these dependencies.

AnotherSimon commented 6 years ago

Ran into an error for funannotate compare --run_dnds full because pal2nal was not installed. Should this be bundled with one of the other dependencies or is this an oversight in the dependencies check?

nextgenusfs commented 6 years ago

Its bundled with funannotate https://github.com/nextgenusfs/funannotate/blob/master/util/pal2nal.pl. What was the error?

AnotherSimon commented 6 years ago

[05/09/18 16:15:43]: dNdS Error: pal2nal failed for funannotate_compare/orthology/orth4347.aln

However this error is not systematic, many orthNNN.aln files pass silently. There's more than 5000 ortho groups for this run but more than 3800 seem to fail.

AnotherSimon commented 6 years ago

Hmm, it seems I have quite a few gene models that do not start with ATG but the main error seems to be a mismatch between the protein and mRNA sequence which seems to happen regularly and only for one of the bugs being compared (in all cases I inspected so far). I went back to the annotate_results folder for this bug but Gene2Products.must-fix.txt is empty except for the header line.

$ pal2nal.pl funannotate_compare/orthology/orth3/orth3.aln funannotate_compare/orthology/orth3/orth3.transcripts.fa

--- ERROR: inconsistency between the following pep and nuc seqs ---

>MyBug1_003557-T1 ATSRWAKIPGARRGSSRTSNTQMKRISECWASWSRSLASLLLWKLTGCVSRWRRLLTAMP SFCSVATVQSCFRIAIPNRSKPSSRFCRFLFMACAFQLSGLGVLRGSTQSHAASRPRWSC LTARSAPSSPSAATMTVLVWMSASRTRKGSCLPTSTRLQRLTMFGAGSRAATPICMPRMP GVSTMCSRSTYSENLRALRTVSWIPLTLRQSELTLAGTPAMHSALWITLATKGCSNTKMC RATRNRGPTTCLPTQFGSATARASRTAPTSSSSGPSVTPSASKWVRPCLPRTSFGSWTLS TPTLKKGARFRVTARKRSTTYFRRTSVRRTRSTMVWSFGAAILCMAIRFHRRTRASRHAC LVTSSRSSRARFGSMPRRKVAWAACTSSQATRVQSASEEAWSFRRKISRGDTRTATLATM NSLWILRLSARCFVPSAKGPPPSAMPSTRRTPSTTVTKAHDLAVQCQGLHRMSNASIIKF LAQNFVGQGQ >MyBug1_003557-T1 GCAACGTCGAGATGGGCAAAGATTCCTGGCGCGAGAAGAGGATCGAGCAGGACGTCAAAT ACACAGATGAAGCGCATTTGAAGCGAGTGCTGGGCCAGCTGGAGTCGCTCCCTGGCCTCG TTACTCCTGTGGAAATTGACCGGCTGCGTGAGCAGATGGCGCAGGTTGCTGACGGCAATG CCTTCCTTCTGCAGTGTGGCGACTGTGCAGAGCTGTTTTCGTATTGCAATCCCAAACAGA TCGAAGCCAAGTTGAAGCTGACGCTTCTAATGTCGCTGATTCTTATTCATGGCGTGCGCC TTCCAGTTGTCCGGATTGGGCGTATTGCGGGGCAGTACGCAAAGCCACGCAGCAAGCCGA CCGAGGTGGTAGAGCTGCCTAACGGCGAGAAGCGCACCGTCCTCTCCTTCCGCGGCGACA ATGTGAACGGTCTTGGTCTGGATGAGCGCGAGCCGGACCCGGAAAGGCTCTTGTCTGCCT ACTTCCACTCGGCTGCAACGCTTAACTATGTTCGGAGCTGGATCTCGAGCGGCTACGCCG ATTTGCATGCCCCGCATGCCTGGAGTTTCCACCATGTGCAGTCGGAGTACCTACAGCGAG AATTTGCGCGCATTACGGACAGTATCGTGGATACCCTTGACTTTATGAAGACAGTCGGAG CTGACCCTGGCAGGCACGCCAGCAATGCACTCAGCACTGTGGATTACTTTGTGAGCCACG AAGGGCTGATGCTCGAATACGAAAATGTGCTGACGCGCGACACGGAACAGGGGCCCTACG ACCTGTCTGCCCACACAATTTGGCTCGGCGACCGCACGCGCCAGCCGGACGGCGCCCACG TCGAGTTCTTCCGGTCCATCCGTAACCCCATCGGCATCAAAGTGGGTCCGTCCATGTCTG CCGAGGACCTCATTCGGATCCTGGACATTGTCAACCCCGACCTTGAAAAAGGGCGCGTGA CGCTAATTTCGCGTTACGGCGCGAAAAAGATCAACGACCTACTTCCGGCGCACATCCGTG CGGTGAAGAACTCGCAGCACAATGGTCTGGTCATTTGGTGCTGCGATCCTATGCATGGCA ATACGATTTCATCGCCGCTGAACCCGAGCATCAAGACACGCATGTTTAGTGACATCGTCA AGGAGCTCACGAGCTCGCTTCGGATCCATGCCGAGGAGGAAAGTCGCTTGGGCGGCGTGC ACCTCGAGCTGACAGGCGACGAGGGTGTGACAGAGTGCATCGGAGGAAGCATGGAGCTTT CGGAGGAAGATCTCTCGAGGCGATACCTGACGTACTGCGACCCTCGCCTGAACTATGAAC AGTCTCTGGATATTGCGTTTATGATCAGCGAGGTGCTTCGTGCCCAGCGCAAAGGGACCG CCGCCGTCAGCGATGCCATCGTGAACGCGCTAGCGCACACCATCGACGACGGTGACGAAA GCTCATGATTTGGCAGTGCAATGCCAAGGATTGCACAGAATGTCCAATGCGAGTATAATT AAATTCTTAGCACAAAATTTTGTTGGACAAGGTCAGAC

Run bl2seq (-p tblastn) or GeneWise to see the inconsistency.

nextgenusfs commented 6 years ago

So does it seem there is an error in how the transcripts or protein sequences are being generated? or something wrong with these particular gene models? This is maybe something that needs some more attention as some gene models will have UTRs (if used RNA-seq), likely that is not being taken into consideration when generating the files and perhaps leading to the error? Are you able to check if the MyBug1_003557-T1 sequences match up with the gene model in the GenBank file?

AnotherSimon commented 6 years ago

I started rerunning the annotation for this particular bug (with RNAseq support) to rule out any issues carried over from previous versions of the pipeline, results still pending.

Using only the sequences posted above it is evident though that the protein sequence was derived by translating the mRNA sequence from the first base onward and not from the start codon. (Expasy site has a nice translation tool that returns all 6 frames .)

Is the pipeline supposed to chop of the UTRs before running pal2nal or does it simply copy from outDir/annotate_results/MyBug.transcripts.fa?

nextgenusfs commented 6 years ago

I will try to look at this today -- if my memory serves I wrote this section likely before I added support for RNA-seq and thus UTRs. So this likely needs to be updated to include only the transcript from the coding sequence.

nextgenusfs commented 6 years ago

So looks like the transcripts were being generated from full length mRNA's -- so UTRs were not being removed. I have added a UTR-stripped transcript to the parser, so testing it locally first, hopefully that fixes the dN/dS issue in pal2nal.

nextgenusfs commented 6 years ago

transcripts should be fixed here: https://github.com/nextgenusfs/funannotate/commit/8e2c9842cb871514006765c0e79297824de3574a

hyphaltip commented 6 years ago

Awesome to make sure this fix is applied in an already run annotation what should I remove to force a rerun of this step?

Jason

Jason E Stajich, PhD Professor and Director, Microbiology Graduate Program Department of Microbiology and Plant Pathology University of California, Riverside http://lab.stajich.org @stajichlab @hyphaltip @zygolife +1 951.827.2363

On May 18, 2018, 10:54 AM -0700, Jon Palmer notifications@github.com, wrote:

transcripts should be fixed here: 8e2c984 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

nextgenusfs commented 6 years ago

Just need to remove the protortho folder from compare output I think. That should re-generate the transcript files for each gene model that are later used in dN/dS.

nextgenusfs commented 6 years ago

Update --- I think also need to remove the orthology folder as well - as it will have results computed from previous run, I'm assuming you want to re-run them as the input data might have changed.

jolobito commented 6 years ago

Hi, I obtained a similar error. I see several pal2nal failed, but not in all ortho groups

dNdS Error: pal2nal failed for felines_compare_10k/orthology/orth4.aln

The problem seems to be a error with the transcript from one the genomes

--- ERROR: inconsistency between the following pep and nuc seqs ---

LWI_000631-T1 NKEKMSDLSESSILRKSQSNDTRVRSNFLYWTKPSSLYLITSTVSSSRSLGGAYSSTLIR RSSCSMDTAWVCPRRFLKCMNVRRTKTGSCIWYMPLRRHLESKCLC LWI_000631-T1 AACAAAGAGTAGAAGATGTCCGACTTATCCGAGAGCAGCATCCTACGAAAATCCCAGTGA TCATAGAACGATACAAGGGTGAGAAGCAACTTCCTGTACTGGACAAAACCAAGTTCCTTG TACCTGATCACGTCAACATGAGTGAGCTCATCAAGATCATTAGGAGGCGCTTACAGCTCA ACGCTAATCAGGCGTTCTTCCTGTTAGTCAATGGACACAGCATGGTGAGTGTGTCCACGC CGATTTCTGAAGTGTATGAACGTGAGAAGGACGAAGACGGGTTCCTGTATATGGTATATG CCTCTCAGGAGACATTTGGAATCAAAGTGTCTGTGTAA

    ###-----   Check the following TBLASTN output:           -----###
    ###-----       your pep -> 'Query'                       -----###
    ###-----       your nuc -> 'Sbjct'                       -----###

Query= LWI_000631-T1 (106 letters)

LWI_000631-T1 Length = 338

Score = 193 bits (490), Expect = 5e-72 Identities = 105/112 (93%), Positives = 105/112 (93%), Gaps = 6/112 (5%) Frame = +1

Query: 1 NKE-KMSDLSESSILRKSQS--NDTRVRSNFLYWTKPSSLYLITST-VSSSRSLGGAYSS 56 NKE KMSDLSESSILRKSQ NDTRVRSNFLYWTKPSSLYLITST VSSSRSLGGAYSS Sbjct: 1 NKE*KMSDLSESSILRKSQ*S*NDTRVRSNFLYWTKPSSLYLITST*VSSSRSLGGAYSS 180

Query: 57 TLIRRSSC-SMDTAW-VCPRRFLKCMNVRRTKTGSCIWYMPLRRHLESKCLC 106 TLIRRSSC SMDTAW VCPRRFLKCMNVRRTKTGSCIWYMPLRRHLESKCLC Sbjct: 181 TLIRRSSCSMDTAWVCPRRFLKCMNVRRTKTGSCIWYMPLRRHLESKCLC 336

The sequence translated from the transcript have a lot of stop codons and is different from the protein in paste /annotate_result

Protein translated in compare

LWI_000631-T1 LWI_000631 NKE*KMSDLSESSILRKSQ*S*NDTRVRSNFLYWTKPSSLYLITST*VSSSRSLGGAYSS TLIRRSSC*SMDTAW*VCPRRFLKCMNVRRTKTGSCIWYMPLRRHLESKCLC

Protein translated in annotate_result

LWI_000631-T1 LWI_000631 QRVEDVRLIREQHPTKIPVIIERYKGEKQLPVLDKTKFLVPDHVNMSELIKIIRRRLQLNANQAFFLLVNGHSMVSVSTP ISEVYEREKDEDGFLYMVYASQETFGIKVSV

Also the transcript is not multiple of 3 and is in frame +3.

LWI_000631-T1 LWI_000631 AACAAAGAGTAGAAGATGTCCGACTTATCCGAGAGCAGCATCCTACGAAAATCCCAGTGATCATAGAACGATACAAGGGT GAGAAGCAACTTCCTGTACTGGACAAAACCAAGTTCCTTGTACCTGATCACGTCAACATGAGTGAGCTCATCAAGATCAT TAGGAGGCGCTTACAGCTCAACGCTAATCAGGCGTTCTTCCTGTTAGTCAATGGACACAGCATGGTGAGTGTGTCCACGC CGATTTCTGAAGTGTATGAACGTGAGAAGGACGAAGACGGGTTCCTGTATATGGTATATGCCTCTCAGGAGACATTTGGA ATCAAAGTGTCTGTGTAA

I am using the last version (1.5) including the fix cited above. I am not using RNA-seq information.

Thanks!