Closed AnotherSimon closed 6 years ago
Ran into an error for funannotate compare --run_dnds full
because pal2nal was not installed. Should this be bundled with one of the other dependencies or is this an oversight in the dependencies check?
Its bundled with funannotate https://github.com/nextgenusfs/funannotate/blob/master/util/pal2nal.pl. What was the error?
[05/09/18 16:15:43]: dNdS Error: pal2nal failed for funannotate_compare/orthology/orth4347.aln
However this error is not systematic, many orthNNN.aln files pass silently. There's more than 5000 ortho groups for this run but more than 3800 seem to fail.
Hmm, it seems I have quite a few gene models that do not start with ATG but the main error seems to be a mismatch between the protein and mRNA sequence which seems to happen regularly and only for one of the bugs being compared (in all cases I inspected so far). I went back to the annotate_results
folder for this bug but Gene2Products.must-fix.txt
is empty except for the header line.
$ pal2nal.pl funannotate_compare/orthology/orth3/orth3.aln funannotate_compare/orthology/orth3/orth3.transcripts.fa
--- ERROR: inconsistency between the following pep and nuc seqs ---
>MyBug1_003557-T1 ATSRWAKIPGARRGSSRTSNTQMKRISECWASWSRSLASLLLWKLTGCVSRWRRLLTAMP SFCSVATVQSCFRIAIPNRSKPSSRFCRFLFMACAFQLSGLGVLRGSTQSHAASRPRWSC LTARSAPSSPSAATMTVLVWMSASRTRKGSCLPTSTRLQRLTMFGAGSRAATPICMPRMP GVSTMCSRSTYSENLRALRTVSWIPLTLRQSELTLAGTPAMHSALWITLATKGCSNTKMC RATRNRGPTTCLPTQFGSATARASRTAPTSSSSGPSVTPSASKWVRPCLPRTSFGSWTLS TPTLKKGARFRVTARKRSTTYFRRTSVRRTRSTMVWSFGAAILCMAIRFHRRTRASRHAC LVTSSRSSRARFGSMPRRKVAWAACTSSQATRVQSASEEAWSFRRKISRGDTRTATLATM NSLWILRLSARCFVPSAKGPPPSAMPSTRRTPSTTVTKAHDLAVQCQGLHRMSNASIIKF LAQNFVGQGQ >MyBug1_003557-T1 GCAACGTCGAGATGGGCAAAGATTCCTGGCGCGAGAAGAGGATCGAGCAGGACGTCAAAT ACACAGATGAAGCGCATTTGAAGCGAGTGCTGGGCCAGCTGGAGTCGCTCCCTGGCCTCG TTACTCCTGTGGAAATTGACCGGCTGCGTGAGCAGATGGCGCAGGTTGCTGACGGCAATG CCTTCCTTCTGCAGTGTGGCGACTGTGCAGAGCTGTTTTCGTATTGCAATCCCAAACAGA TCGAAGCCAAGTTGAAGCTGACGCTTCTAATGTCGCTGATTCTTATTCATGGCGTGCGCC TTCCAGTTGTCCGGATTGGGCGTATTGCGGGGCAGTACGCAAAGCCACGCAGCAAGCCGA CCGAGGTGGTAGAGCTGCCTAACGGCGAGAAGCGCACCGTCCTCTCCTTCCGCGGCGACA ATGTGAACGGTCTTGGTCTGGATGAGCGCGAGCCGGACCCGGAAAGGCTCTTGTCTGCCT ACTTCCACTCGGCTGCAACGCTTAACTATGTTCGGAGCTGGATCTCGAGCGGCTACGCCG ATTTGCATGCCCCGCATGCCTGGAGTTTCCACCATGTGCAGTCGGAGTACCTACAGCGAG AATTTGCGCGCATTACGGACAGTATCGTGGATACCCTTGACTTTATGAAGACAGTCGGAG CTGACCCTGGCAGGCACGCCAGCAATGCACTCAGCACTGTGGATTACTTTGTGAGCCACG AAGGGCTGATGCTCGAATACGAAAATGTGCTGACGCGCGACACGGAACAGGGGCCCTACG ACCTGTCTGCCCACACAATTTGGCTCGGCGACCGCACGCGCCAGCCGGACGGCGCCCACG TCGAGTTCTTCCGGTCCATCCGTAACCCCATCGGCATCAAAGTGGGTCCGTCCATGTCTG CCGAGGACCTCATTCGGATCCTGGACATTGTCAACCCCGACCTTGAAAAAGGGCGCGTGA CGCTAATTTCGCGTTACGGCGCGAAAAAGATCAACGACCTACTTCCGGCGCACATCCGTG CGGTGAAGAACTCGCAGCACAATGGTCTGGTCATTTGGTGCTGCGATCCTATGCATGGCA ATACGATTTCATCGCCGCTGAACCCGAGCATCAAGACACGCATGTTTAGTGACATCGTCA AGGAGCTCACGAGCTCGCTTCGGATCCATGCCGAGGAGGAAAGTCGCTTGGGCGGCGTGC ACCTCGAGCTGACAGGCGACGAGGGTGTGACAGAGTGCATCGGAGGAAGCATGGAGCTTT CGGAGGAAGATCTCTCGAGGCGATACCTGACGTACTGCGACCCTCGCCTGAACTATGAAC AGTCTCTGGATATTGCGTTTATGATCAGCGAGGTGCTTCGTGCCCAGCGCAAAGGGACCG CCGCCGTCAGCGATGCCATCGTGAACGCGCTAGCGCACACCATCGACGACGGTGACGAAA GCTCATGATTTGGCAGTGCAATGCCAAGGATTGCACAGAATGTCCAATGCGAGTATAATT AAATTCTTAGCACAAAATTTTGTTGGACAAGGTCAGAC
Run bl2seq (-p tblastn) or GeneWise to see the inconsistency.
So does it seem there is an error in how the transcripts or protein sequences are being generated? or something wrong with these particular gene models? This is maybe something that needs some more attention as some gene models will have UTRs (if used RNA-seq), likely that is not being taken into consideration when generating the files and perhaps leading to the error? Are you able to check if the MyBug1_003557-T1
sequences match up with the gene model in the GenBank file?
I started rerunning the annotation for this particular bug (with RNAseq support) to rule out any issues carried over from previous versions of the pipeline, results still pending.
Using only the sequences posted above it is evident though that the protein sequence was derived by translating the mRNA sequence from the first base onward and not from the start codon. (Expasy site has a nice translation tool that returns all 6 frames .)
Is the pipeline supposed to chop of the UTRs before running pal2nal or does it simply copy from outDir/annotate_results/MyBug.transcripts.fa
?
I will try to look at this today -- if my memory serves I wrote this section likely before I added support for RNA-seq and thus UTRs. So this likely needs to be updated to include only the transcript from the coding sequence.
So looks like the transcripts were being generated from full length mRNA's -- so UTRs were not being removed. I have added a UTR-stripped transcript to the parser, so testing it locally first, hopefully that fixes the dN/dS issue in pal2nal.
transcripts should be fixed here: https://github.com/nextgenusfs/funannotate/commit/8e2c9842cb871514006765c0e79297824de3574a
Awesome to make sure this fix is applied in an already run annotation what should I remove to force a rerun of this step?
Jason
Jason E Stajich, PhD Professor and Director, Microbiology Graduate Program Department of Microbiology and Plant Pathology University of California, Riverside http://lab.stajich.org @stajichlab @hyphaltip @zygolife +1 951.827.2363
On May 18, 2018, 10:54 AM -0700, Jon Palmer notifications@github.com, wrote:
transcripts should be fixed here: 8e2c984 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
Just need to remove the protortho folder from compare output I think. That should re-generate the transcript files for each gene model that are later used in dN/dS.
Update --- I think also need to remove the orthology
folder as well - as it will have results computed from previous run, I'm assuming you want to re-run them as the input data might have changed.
Hi, I obtained a similar error. I see several pal2nal failed, but not in all ortho groups
dNdS Error: pal2nal failed for felines_compare_10k/orthology/orth4.aln
The problem seems to be a error with the transcript from one the genomes
--- ERROR: inconsistency between the following pep and nuc seqs ---
LWI_000631-T1 NKEKMSDLSESSILRKSQSNDTRVRSNFLYWTKPSSLYLITSTVSSSRSLGGAYSSTLIR RSSCSMDTAWVCPRRFLKCMNVRRTKTGSCIWYMPLRRHLESKCLC LWI_000631-T1 AACAAAGAGTAGAAGATGTCCGACTTATCCGAGAGCAGCATCCTACGAAAATCCCAGTGA TCATAGAACGATACAAGGGTGAGAAGCAACTTCCTGTACTGGACAAAACCAAGTTCCTTG TACCTGATCACGTCAACATGAGTGAGCTCATCAAGATCATTAGGAGGCGCTTACAGCTCA ACGCTAATCAGGCGTTCTTCCTGTTAGTCAATGGACACAGCATGGTGAGTGTGTCCACGC CGATTTCTGAAGTGTATGAACGTGAGAAGGACGAAGACGGGTTCCTGTATATGGTATATG CCTCTCAGGAGACATTTGGAATCAAAGTGTCTGTGTAA
###----- Check the following TBLASTN output: -----### ###----- your pep -> 'Query' -----### ###----- your nuc -> 'Sbjct' -----###
Query= LWI_000631-T1 (106 letters)
LWI_000631-T1 Length = 338
Score = 193 bits (490), Expect = 5e-72 Identities = 105/112 (93%), Positives = 105/112 (93%), Gaps = 6/112 (5%) Frame = +1
Query: 1 NKE-KMSDLSESSILRKSQS--NDTRVRSNFLYWTKPSSLYLITST-VSSSRSLGGAYSS 56 NKE KMSDLSESSILRKSQ NDTRVRSNFLYWTKPSSLYLITST VSSSRSLGGAYSS Sbjct: 1 NKE*KMSDLSESSILRKSQ*S*NDTRVRSNFLYWTKPSSLYLITST*VSSSRSLGGAYSS 180
Query: 57 TLIRRSSC-SMDTAW-VCPRRFLKCMNVRRTKTGSCIWYMPLRRHLESKCLC 106 TLIRRSSC SMDTAW VCPRRFLKCMNVRRTKTGSCIWYMPLRRHLESKCLC Sbjct: 181 TLIRRSSCSMDTAWVCPRRFLKCMNVRRTKTGSCIWYMPLRRHLESKCLC 336
The sequence translated from the transcript have a lot of stop codons and is different from the protein in paste /annotate_result
Protein translated in compare
LWI_000631-T1 LWI_000631 NKE*KMSDLSESSILRKSQ*S*NDTRVRSNFLYWTKPSSLYLITST*VSSSRSLGGAYSS TLIRRSSC*SMDTAW*VCPRRFLKCMNVRRTKTGSCIWYMPLRRHLESKCLC
Protein translated in annotate_result
LWI_000631-T1 LWI_000631 QRVEDVRLIREQHPTKIPVIIERYKGEKQLPVLDKTKFLVPDHVNMSELIKIIRRRLQLNANQAFFLLVNGHSMVSVSTP ISEVYEREKDEDGFLYMVYASQETFGIKVSV
Also the transcript is not multiple of 3 and is in frame +3.
LWI_000631-T1 LWI_000631 AACAAAGAGTAGAAGATGTCCGACTTATCCGAGAGCAGCATCCTACGAAAATCCCAGTGATCATAGAACGATACAAGGGT GAGAAGCAACTTCCTGTACTGGACAAAACCAAGTTCCTTGTACCTGATCACGTCAACATGAGTGAGCTCATCAAGATCAT TAGGAGGCGCTTACAGCTCAACGCTAATCAGGCGTTCTTCCTGTTAGTCAATGGACACAGCATGGTGAGTGTGTCCACGC CGATTTCTGAAGTGTATGAACGTGAGAAGGACGAAGACGGGTTCCTGTATATGGTATATGCCTCTCAGGAGACATTTGGA ATCAAAGTGTCTGTGTAA
I am using the last version (1.5) including the fix cited above. I am not using RNA-seq information.
Thanks!
When running
funannotate compare
I get the following error message:Neither the install guide nor
funannotate check
mention these dependencies.