Open fmobegi opened 2 years ago
The EC numbers are coming from eggnog? Or some other source? Wonder if the EC database has unified gene names and product deflines. There are a few tools that can add EC numbers, problem with scraping from eggnog is that they aren't really assigned and or filtered via some evalue/confidence threshold.
I figured your data for allocating products come from InterPro scan; ironically, the EC numbers + product names don't always match whatever you get from BRENDA or EXPASSY wget https://ftp.expasy.org/databases/enzyme/enzyme.dat
.
I will try some brute force to reassign product names for all annotations.
A good example is product=Ubiquitin-conjugating enzyme E2 1;Dbxref=InterPro:IPR023313,PFAM:PF00179,InterPro:IPR000608;EC_number=1.3.8.6;note=COG:E,EggNog:ENOG503NW16;
That EC corresponds with EC 1.3.8.6 - glutaryl-CoA dehydrogenase (ETF)
which makes the assigned annotation wrong.
Or the EC annotation is wrong. I think it's eggnog but I can double check. At work I use ECPred to do this - it's quite slow but at least there is some confidence in the data.
I will try run ECpred and then compare the annotations with IPRscan
Currently in funannotate EC only coming from eggnog. So yes would be interesting to see how all three of those compare.
agree I get these errors too - I think we need to not trust eggnog as much perhaps. Wondering if IPR are more often consistent - so I'll be curious what you find @fmobegi.
The issues with eggnog product names vs NCBI expected names do cause some manual checking needed. Have not tried ECPred but maybe we need to on my end too.
Finished the ECPred
analysis. See the attached results. Some of the EC: numbers are matched, a few are off. As for the description, Most of the "product names" assigned by funannotate tend to be completely different from the associated EC# (probably Interpro descriptions).
ME14-ecs.txt
Thanks @fmobegi. Generally there is significant overlap. You would gain a lot more EC numbers using ECPred -- note the dashes are valid in NCBI for EC numbers (I think), so you could reformat and pass this to the custom annotations if you wanted to incorporate.
$ cat ME14-ecs.txt | grep -v 'non Enzyme' | grep -v 'no Prediction' | head -n 50
Protein Gene ECPred ConfidenceScore(max=1.0) Funannotate_EC
EKO05_000001-T1 EKO05_000001 2.7.11.1 0.79 2.7.11.1
EKO05_000002-T1 EKO05_000002 1.5.3.- 1 1.5.3
EKO05_000003-T2 EKO05_000003 1.14.-.- 0.87
EKO05_000003-T1 EKO05_000003 1.14.12.- 0.83
EKO05_000004-T1 EKO05_000004 1.14.13.- 0.77
EKO05_000005-T1 EKO05_000005 1.-.-.- 0.5
EKO05_000008-T1 EKO05_000008 3.2.1.- 0.68
EKO05_000010-T1 EKO05_000010 1.-.-.- 0.57
EKO05_000011-T1 EKO05_000011 3.1.3.- 0.7
EKO05_000014-T1 EKO05_000014 3.1.1.3 0.74 3.1.1.20
EKO05_000016-T1 EKO05_000016 1.10.3.2 0.91 1.10.3.3
EKO05_000017-T1 EKO05_000017 3.2.1.39 0.98 3.2.1.58
EKO05_000018-T1 EKO05_000018 3.-.-.- 0.45
EKO05_000019-T1 EKO05_000019 3.-.-.- 0.53
EKO05_000020-T1 EKO05_000020 3.8.-.- 0.61 3.3.2.9
EKO05_000022-T1 EKO05_000022 3.1.3.- 0.93 3.1.3.37
EKO05_000027-T1 EKO05_000027 2.-.-.- 0.47
EKO05_000029-T1 EKO05_000029 2.3.1.- 0.63
EKO05_000033-T1 EKO05_000033 1.15.1.1 1 1.15.1.1
EKO05_000034-T1 EKO05_000034 2.2.1.2 1 2.2.1.2
EKO05_000035-T1 EKO05_000035 1.1.1.- 0.78
EKO05_000036-T1 EKO05_000036 2.3.1.51 0.99 2.3.1.51
Generally I think these types of assignments are potentially problematic in the fun annotate output (which is just pulling from EggNog results). In these cases ECPred was able to assign a general family but not all the way down to four digits, meaning a lower hit. But since EggNog seems to put the full 4 digit if it is present, this is probably assigning an EC function that is not validated.
EKO05_000020-T1 EKO05_000020 3.8.-.- 0.61 3.3.2.9
EKO05_000022-T1 EKO05_000022 3.1.3.- 0.93 3.1.3.37
Do you have some examples of the product deflines are "completely different"? This result is probably expected -- and it doesn't mean it is necessarily wrong.
Perhaps we should just not pull the EC numbers from EggNog. Or alternatively we add a parameter like --strict
and that would strip these (and potentially other non compliant annotations).
I think it's fine using a strict filter that will exclude EC# altogether if
the description is "hypothetical" or the EC is non-definitive.
Where the EC# is determined to 4 digits, we modified the description to
that provided by ExPASy.
We whipped up a simple script ( https://github.com/JWDebler/bioinformatics/blob/master/parse_EC_number_after_funannotate.py
)
to
do just that on the final GFF file.
Kind regards,
Dr Fredrick Mobegi Bioinfomaticican (Centre for Crop and Disease Management)
"Assuredly we bring not innocence into the world, we bring impurity much rather: that which purifies us is trial, and trial is by what is contrary." John Milton (1608-1674)
On Mon, Oct 4, 2021 at 11:48 AM Jon Palmer @.***> wrote:
Thanks @fmobegi https://github.com/fmobegi. Generally there is significant overlap. You would gain a lot more EC numbers using ECPred -- note the dashes are valid in NCBI for EC numbers (I think), so you could reformat and pass this to the custom annotations if you wanted to incorporate.
$ cat ME14-ecs.txt | grep -v 'non Enzyme' | grep -v 'no Prediction' | head -n 50 Protein Gene ECPred ConfidenceScore(max=1.0) Funannotate_EC EKO05_000001-T1 EKO05_000001 2.7.11.1 0.79 2.7.11.1 EKO05_000002-T1 EKO05_000002 1.5.3.- 1 1.5.3 EKO05_000003-T2 EKO05_000003 1.14.-.- 0.87
EKO05_000003-T1 EKO05_000003 1.14.12.- 0.83
EKO05_000004-T1 EKO05_000004 1.14.13.- 0.77
EKO05_000005-T1 EKO05_000005 1.-.-.- 0.5 EKO05_000008-T1 EKO05_000008 3.2.1.- 0.68
EKO05_000010-T1 EKO05_000010 1.-.-.- 0.57
EKO05_000011-T1 EKO05_000011 3.1.3.- 0.7 EKO05_000014-T1 EKO05_000014 3.1.1.3 0.74 3.1.1.20 EKO05_000016-T1 EKO05_000016 1.10.3.2 0.91 1.10.3.3 EKO05_000017-T1 EKO05_000017 3.2.1.39 0.98 3.2.1.58 EKO05_000018-T1 EKO05_000018 3.-.-.- 0.45
EKO05_000019-T1 EKO05_000019 3.-.-.- 0.53
EKO05_000020-T1 EKO05_000020 3.8.-.- 0.61 3.3.2.9 EKO05_000022-T1 EKO05_000022 3.1.3.- 0.93 3.1.3.37 EKO05_000027-T1 EKO05_000027 2.-.-.- 0.47
EKO05_000029-T1 EKO05_000029 2.3.1.- 0.63
EKO05_000033-T1 EKO05_000033 1.15.1.1 1 1.15.1.1 EKO05_000034-T1 EKO05_000034 2.2.1.2 1 2.2.1.2 EKO05_000035-T1 EKO05_000035 1.1.1.- 0.78
EKO05_000036-T1 EKO05_000036 2.3.1.51 0.99 2.3.1.51Generally I think these types of assignments are potentially problematic in the fun annotate output (which is just pulling from EggNog results). In these cases ECPred was able to assign a general family but not all the way down to four digits, meaning a lower hit. But since EggNog seems to put the full 4 digit if it is present, this is probably assigning an EC function that is not validated.
EKO05_000020-T1 EKO05_000020 3.8.-.- 0.61 3.3.2.9 EKO05_000022-T1 EKO05_000022 3.1.3.- 0.93 3.1.3.37
Do you have some examples of the product deflines are "completely different"? This result is probably expected -- and it doesn't mean it is necessarily wrong.
Perhaps we should just not pull the EC numbers from EggNog. Or alternatively we add a parameter like --strict and that would strip these (and potentially other non compliant annotations).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/643#issuecomment-933117829, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD3CPBZH63SAK45O5MKKEVTUFEPZJANCNFSM5E4HAICA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Hey Jon, Just a quick update on this issue. After fixing the product names, we came across another related issue.
Under what category of column 3 (gene, mRNA, *UTR, exon, CDS) should the product description be placed? At the moment, funannotate places this information under rRNA. Should this information also be added to the CDS features?
##gff-version 3
ArME14_ctg_01 funannotate gene 28584 30375 . - . ID=EKO05_000001;
ArME14_ctg_01 funannotate mRNA 28584 30375 . - . ID=EKO05_000001-T1;Parent=EKO05_000001;product=hypothetical protein;Ontology_term=GO:0004672,GO:0005524,GO:0006468;Dbxref=InterPro:IPR000719,InterPro:IPR017441,PFAM:PF00069;EC_number=2.7.11.1;note=EggNog:ENOG503P0IR,COG:T;
ArME14_ctg_01 funannotate five_prime_UTR 30094 30375 . - . ID=EKO05_000001-T1.utr5p1;Parent=EKO05_000001-T1;
ArME14_ctg_01 funannotate exon 29885 30375 . - . ID=EKO05_000001-T1.exon1;Parent=EKO05_000001-T1;
ArME14_ctg_01 funannotate exon 29329 29831 . - . ID=EKO05_000001-T1.exon2;Parent=EKO05_000001-T1;
ArME14_ctg_01 funannotate exon 28584 29279 . - . ID=EKO05_000001-T1.exon3;Parent=EKO05_000001-T1;
ArME14_ctg_01 funannotate three_prime_UTR 28584 28659 . - . ID=EKO05_000001-T1.utr3p1;Parent=EKO05_000001-T1;
ArME14_ctg_01 funannotate CDS 29885 30093 . - 0 ID=EKO05_000001-T1.cds;Parent=EKO05_000001-T1;
ArME14_ctg_01 funannotate CDS 29329 29831 . - 1 ID=EKO05_000001-T1.cds;Parent=EKO05_000001-T1;
ArME14_ctg_01 funannotate CDS 28660 29279 . - 2 ID=EKO05_000001-T1.cds;Parent=EKO05_000001-T1;
ArME14_ctg_01 funannotate gene 73133 74748 . - . ID=EKO05_000002;
ArME14_ctg_01 funannotate mRNA 73133 74748 . - . ID=EKO05_000002-T1;Parent=EKO05_000002;product=hypothetical protein;Ontology_term=GO:0055114,GO:0016491;Dbxref=PFAM:PF01266,InterPro:IPR006076;EC_number=1.5.3;note=EggNog:ENOG503NUZZ,COG:E;
ArME14_ctg_01 funannotate five_prime_UTR 74585 74748 . - . ID=EKO05_000002-T1.utr5p1;Parent=EKO05_000002-T1;
ArME14_ctg_01 funannotate exon 73458 74748 . - . ID=EKO05_000002-T1.exon1;Parent=EKO05_000002-T1;
ArME14_ctg_01 funannotate exon 73133 73395 . - . ID=EKO05_000002-T1.exon2;Parent=EKO05_000002-T1;
ArME14_ctg_01 funannotate three_prime_UTR 73133 73202 . - . ID=EKO05_000002-T1.utr3p1;Parent=EKO05_000002-T1;
ArME14_ctg_01 funannotate CDS 73458 74584 . - 0 ID=EKO05_000002-T1.cds;Parent=EKO05_000002-T1;
ArME14_ctg_01 funannotate CDS 73203 73395 . - 1 ID=EKO05_000002-T1.cds;Parent=EKO05_000002-T1;
ArME14_ctg_01 funannotate gene 76053 77838 . + . ID=EKO05_000003;
The annotation SQN file arising from such a GFF normally has PROTEIN_NAMES: All proteins have the same name "hypothetical protein"
meaning the names are not transferred to the NCBI submission file
Hi, just wanted to check in if any decision has been made regarding EC numbers and 'hypothetical protein' products. Currently the output files still have both in them and NCBI will complain about that.
Hi,
Sorry to prompt again, but has there been any update on this issue? The NCBI have just rejected a submission of mine with the following message: "[2] There are a lot of hypothetical proteins that have EC numbers. We expect that a protein characterized enough to be given an EC number should have a product name. Please change the name or remove the EC number" FATAL! 859 protein features have an EC number and a protein name of 'unknown protein' or 'hypothetical protein'
Fredrick, thank you for the script, but how does one go about using it? If it's on the GFF file, do I use it and then re-run funannotate annotate directing the --gff flag to the new gff3, or do I need to run table2asn?
Many thanks in advance
Best
Zack
@zacksaud I wrote a script in the meantime (mentione above) that parses the funannotate GFF and looks for EC numbers, then pulls the corresponding enzyme name from the Expasy databse. I managed to get my annotations accepted by NCBI that way. I just updated the before mentioned script as some encoding has changed in the Expasy database file. Just ran it over a recent funannotate GFF and it works.
https://github.com/JWDebler/bioinformatics/blob/master/parse_EC_number_after_funannotate.py
needs the beautifulsoup4 package to run.
You just run the script using the funannotate gff as input:
python parse_EC_number_after_funannotate.py -i funannotate.gff" > funannotate.fixed.gff
And then rerun table2asn with the fixed gff.
Cheers, Johannes
@JWDebler that's amazing, thank you!
@zacksaud no worries. It's not ideal, but it solves the problem you have. Preferably something similar will get rolled into funannotate in the future including maybe batter EC allocation, as mentioned above EggNOG might not be the best.
It seems like the NCBI doesn't like hypothetical proteins with EC numbers anymore. They are unhappy with such annotations and are asking for the EC# to be removed.
ArME14_ctg_01 funannotate mRNA 269174 271364 . + . ID=EKO05_000044-T1;Parent=EKO05_000044;product=hypothetical protein;Ontology_term=GO:0046556,GO:0031221,GO:0046373,GO:0019566;Dbxref=InterPro:IPR015289,PFAM:PF09206,InterPro:IPR038964,InterPro:IPR007934,PFAM:PF05270;EC_number=3.2.1.55;note=COG:G,EggNog:ENOG503NY8Q,CAZy:GH54,CAZy:CBM42,SECRETED:SignalP(1-34,SECRETED:cutsite=VAA-AP,SECRETED:prob=0.5984);
Could be something to consider in future issues.