nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
312 stars 83 forks source link

hypothetical proteins that have EC numbers #643

Open fmobegi opened 2 years ago

fmobegi commented 2 years ago

It seems like the NCBI doesn't like hypothetical proteins with EC numbers anymore. They are unhappy with such annotations and are asking for the EC# to be removed.

ArME14_ctg_01 funannotate mRNA 269174 271364 . + . ID=EKO05_000044-T1;Parent=EKO05_000044;product=hypothetical protein;Ontology_term=GO:0046556,GO:0031221,GO:0046373,GO:0019566;Dbxref=InterPro:IPR015289,PFAM:PF09206,InterPro:IPR038964,InterPro:IPR007934,PFAM:PF05270;EC_number=3.2.1.55;note=COG:G,EggNog:ENOG503NY8Q,CAZy:GH54,CAZy:CBM42,SECRETED:SignalP(1-34,SECRETED:cutsite=VAA-AP,SECRETED:prob=0.5984);

Could be something to consider in future issues.

nextgenusfs commented 2 years ago

The EC numbers are coming from eggnog? Or some other source? Wonder if the EC database has unified gene names and product deflines. There are a few tools that can add EC numbers, problem with scraping from eggnog is that they aren't really assigned and or filtered via some evalue/confidence threshold.

fmobegi commented 2 years ago

I figured your data for allocating products come from InterPro scan; ironically, the EC numbers + product names don't always match whatever you get from BRENDA or EXPASSY wget https://ftp.expasy.org/databases/enzyme/enzyme.dat. I will try some brute force to reassign product names for all annotations. A good example is product=Ubiquitin-conjugating enzyme E2 1;Dbxref=InterPro:IPR023313,PFAM:PF00179,InterPro:IPR000608;EC_number=1.3.8.6;note=COG:E,EggNog:ENOG503NW16; That EC corresponds with EC 1.3.8.6 - glutaryl-CoA dehydrogenase (ETF) which makes the assigned annotation wrong.

nextgenusfs commented 2 years ago

Or the EC annotation is wrong. I think it's eggnog but I can double check. At work I use ECPred to do this - it's quite slow but at least there is some confidence in the data.

fmobegi commented 2 years ago

I will try run ECpred and then compare the annotations with IPRscan

nextgenusfs commented 2 years ago

Currently in funannotate EC only coming from eggnog. So yes would be interesting to see how all three of those compare.

hyphaltip commented 2 years ago

agree I get these errors too - I think we need to not trust eggnog as much perhaps. Wondering if IPR are more often consistent - so I'll be curious what you find @fmobegi.

The issues with eggnog product names vs NCBI expected names do cause some manual checking needed. Have not tried ECPred but maybe we need to on my end too.

fmobegi commented 2 years ago

Finished the ECPred analysis. See the attached results. Some of the EC: numbers are matched, a few are off. As for the description, Most of the "product names" assigned by funannotate tend to be completely different from the associated EC# (probably Interpro descriptions). ME14-ecs.txt

nextgenusfs commented 2 years ago

Thanks @fmobegi. Generally there is significant overlap. You would gain a lot more EC numbers using ECPred -- note the dashes are valid in NCBI for EC numbers (I think), so you could reformat and pass this to the custom annotations if you wanted to incorporate.

$ cat ME14-ecs.txt | grep -v 'non Enzyme' | grep -v 'no Prediction' | head -n 50
Protein Gene    ECPred  ConfidenceScore(max=1.0)    Funannotate_EC
EKO05_000001-T1 EKO05_000001    2.7.11.1    0.79    2.7.11.1
EKO05_000002-T1 EKO05_000002    1.5.3.- 1   1.5.3
EKO05_000003-T2 EKO05_000003    1.14.-.-    0.87    
EKO05_000003-T1 EKO05_000003    1.14.12.-   0.83    
EKO05_000004-T1 EKO05_000004    1.14.13.-   0.77    
EKO05_000005-T1 EKO05_000005    1.-.-.- 0.5 
EKO05_000008-T1 EKO05_000008    3.2.1.- 0.68    
EKO05_000010-T1 EKO05_000010    1.-.-.- 0.57    
EKO05_000011-T1 EKO05_000011    3.1.3.- 0.7 
EKO05_000014-T1 EKO05_000014    3.1.1.3 0.74    3.1.1.20
EKO05_000016-T1 EKO05_000016    1.10.3.2    0.91    1.10.3.3
EKO05_000017-T1 EKO05_000017    3.2.1.39    0.98    3.2.1.58
EKO05_000018-T1 EKO05_000018    3.-.-.- 0.45    
EKO05_000019-T1 EKO05_000019    3.-.-.- 0.53    
EKO05_000020-T1 EKO05_000020    3.8.-.- 0.61    3.3.2.9
EKO05_000022-T1 EKO05_000022    3.1.3.- 0.93    3.1.3.37
EKO05_000027-T1 EKO05_000027    2.-.-.- 0.47    
EKO05_000029-T1 EKO05_000029    2.3.1.- 0.63    
EKO05_000033-T1 EKO05_000033    1.15.1.1    1   1.15.1.1
EKO05_000034-T1 EKO05_000034    2.2.1.2 1   2.2.1.2
EKO05_000035-T1 EKO05_000035    1.1.1.- 0.78    
EKO05_000036-T1 EKO05_000036    2.3.1.51    0.99    2.3.1.51

Generally I think these types of assignments are potentially problematic in the fun annotate output (which is just pulling from EggNog results). In these cases ECPred was able to assign a general family but not all the way down to four digits, meaning a lower hit. But since EggNog seems to put the full 4 digit if it is present, this is probably assigning an EC function that is not validated.

EKO05_000020-T1 EKO05_000020    3.8.-.- 0.61    3.3.2.9
EKO05_000022-T1 EKO05_000022    3.1.3.- 0.93    3.1.3.37

Do you have some examples of the product deflines are "completely different"? This result is probably expected -- and it doesn't mean it is necessarily wrong.

Perhaps we should just not pull the EC numbers from EggNog. Or alternatively we add a parameter like --strict and that would strip these (and potentially other non compliant annotations).

fmobegi commented 2 years ago

I think it's fine using a strict filter that will exclude EC# altogether if the description is "hypothetical" or the EC is non-definitive. Where the EC# is determined to 4 digits, we modified the description to that provided by ExPASy. We whipped up a simple script ( https://github.com/JWDebler/bioinformatics/blob/master/parse_EC_number_after_funannotate.py) to do just that on the final GFF file.

Kind regards,

Dr Fredrick Mobegi Bioinfomaticican (Centre for Crop and Disease Management)

"Assuredly we bring not innocence into the world, we bring impurity much rather: that which purifies us is trial, and trial is by what is contrary." John Milton (1608-1674)

On Mon, Oct 4, 2021 at 11:48 AM Jon Palmer @.***> wrote:

Thanks @fmobegi https://github.com/fmobegi. Generally there is significant overlap. You would gain a lot more EC numbers using ECPred -- note the dashes are valid in NCBI for EC numbers (I think), so you could reformat and pass this to the custom annotations if you wanted to incorporate.

$ cat ME14-ecs.txt | grep -v 'non Enzyme' | grep -v 'no Prediction' | head -n 50 Protein Gene ECPred ConfidenceScore(max=1.0) Funannotate_EC EKO05_000001-T1 EKO05_000001 2.7.11.1 0.79 2.7.11.1 EKO05_000002-T1 EKO05_000002 1.5.3.- 1 1.5.3 EKO05_000003-T2 EKO05_000003 1.14.-.- 0.87
EKO05_000003-T1 EKO05_000003 1.14.12.- 0.83
EKO05_000004-T1 EKO05_000004 1.14.13.- 0.77
EKO05_000005-T1 EKO05_000005 1.-.-.- 0.5 EKO05_000008-T1 EKO05_000008 3.2.1.- 0.68
EKO05_000010-T1 EKO05_000010 1.-.-.- 0.57
EKO05_000011-T1 EKO05_000011 3.1.3.- 0.7 EKO05_000014-T1 EKO05_000014 3.1.1.3 0.74 3.1.1.20 EKO05_000016-T1 EKO05_000016 1.10.3.2 0.91 1.10.3.3 EKO05_000017-T1 EKO05_000017 3.2.1.39 0.98 3.2.1.58 EKO05_000018-T1 EKO05_000018 3.-.-.- 0.45
EKO05_000019-T1 EKO05_000019 3.-.-.- 0.53
EKO05_000020-T1 EKO05_000020 3.8.-.- 0.61 3.3.2.9 EKO05_000022-T1 EKO05_000022 3.1.3.- 0.93 3.1.3.37 EKO05_000027-T1 EKO05_000027 2.-.-.- 0.47
EKO05_000029-T1 EKO05_000029 2.3.1.- 0.63
EKO05_000033-T1 EKO05_000033 1.15.1.1 1 1.15.1.1 EKO05_000034-T1 EKO05_000034 2.2.1.2 1 2.2.1.2 EKO05_000035-T1 EKO05_000035 1.1.1.- 0.78
EKO05_000036-T1 EKO05_000036 2.3.1.51 0.99 2.3.1.51

Generally I think these types of assignments are potentially problematic in the fun annotate output (which is just pulling from EggNog results). In these cases ECPred was able to assign a general family but not all the way down to four digits, meaning a lower hit. But since EggNog seems to put the full 4 digit if it is present, this is probably assigning an EC function that is not validated.

EKO05_000020-T1 EKO05_000020 3.8.-.- 0.61 3.3.2.9 EKO05_000022-T1 EKO05_000022 3.1.3.- 0.93 3.1.3.37

Do you have some examples of the product deflines are "completely different"? This result is probably expected -- and it doesn't mean it is necessarily wrong.

Perhaps we should just not pull the EC numbers from EggNog. Or alternatively we add a parameter like --strict and that would strip these (and potentially other non compliant annotations).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/643#issuecomment-933117829, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD3CPBZH63SAK45O5MKKEVTUFEPZJANCNFSM5E4HAICA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

fmobegi commented 2 years ago

Hey Jon, Just a quick update on this issue. After fixing the product names, we came across another related issue.

The placing of product="abcd"

Under what category of column 3 (gene, mRNA, *UTR, exon, CDS) should the product description be placed? At the moment, funannotate places this information under rRNA. Should this information also be added to the CDS features?

##gff-version 3
ArME14_ctg_01   funannotate     gene    28584   30375   .       -       .       ID=EKO05_000001;
ArME14_ctg_01   funannotate     mRNA    28584   30375   .       -       .       ID=EKO05_000001-T1;Parent=EKO05_000001;product=hypothetical protein;Ontology_term=GO:0004672,GO:0005524,GO:0006468;Dbxref=InterPro:IPR000719,InterPro:IPR017441,PFAM:PF00069;EC_number=2.7.11.1;note=EggNog:ENOG503P0IR,COG:T;
ArME14_ctg_01   funannotate     five_prime_UTR  30094   30375   .       -       .       ID=EKO05_000001-T1.utr5p1;Parent=EKO05_000001-T1;
ArME14_ctg_01   funannotate     exon    29885   30375   .       -       .       ID=EKO05_000001-T1.exon1;Parent=EKO05_000001-T1;
ArME14_ctg_01   funannotate     exon    29329   29831   .       -       .       ID=EKO05_000001-T1.exon2;Parent=EKO05_000001-T1;
ArME14_ctg_01   funannotate     exon    28584   29279   .       -       .       ID=EKO05_000001-T1.exon3;Parent=EKO05_000001-T1;
ArME14_ctg_01   funannotate     three_prime_UTR 28584   28659   .       -       .       ID=EKO05_000001-T1.utr3p1;Parent=EKO05_000001-T1;
ArME14_ctg_01   funannotate     CDS     29885   30093   .       -       0       ID=EKO05_000001-T1.cds;Parent=EKO05_000001-T1;
ArME14_ctg_01   funannotate     CDS     29329   29831   .       -       1       ID=EKO05_000001-T1.cds;Parent=EKO05_000001-T1;
ArME14_ctg_01   funannotate     CDS     28660   29279   .       -       2       ID=EKO05_000001-T1.cds;Parent=EKO05_000001-T1;
ArME14_ctg_01   funannotate     gene    73133   74748   .       -       .       ID=EKO05_000002;
ArME14_ctg_01   funannotate     mRNA    73133   74748   .       -       .       ID=EKO05_000002-T1;Parent=EKO05_000002;product=hypothetical protein;Ontology_term=GO:0055114,GO:0016491;Dbxref=PFAM:PF01266,InterPro:IPR006076;EC_number=1.5.3;note=EggNog:ENOG503NUZZ,COG:E;
ArME14_ctg_01   funannotate     five_prime_UTR  74585   74748   .       -       .       ID=EKO05_000002-T1.utr5p1;Parent=EKO05_000002-T1;
ArME14_ctg_01   funannotate     exon    73458   74748   .       -       .       ID=EKO05_000002-T1.exon1;Parent=EKO05_000002-T1;
ArME14_ctg_01   funannotate     exon    73133   73395   .       -       .       ID=EKO05_000002-T1.exon2;Parent=EKO05_000002-T1;
ArME14_ctg_01   funannotate     three_prime_UTR 73133   73202   .       -       .       ID=EKO05_000002-T1.utr3p1;Parent=EKO05_000002-T1;
ArME14_ctg_01   funannotate     CDS     73458   74584   .       -       0       ID=EKO05_000002-T1.cds;Parent=EKO05_000002-T1;
ArME14_ctg_01   funannotate     CDS     73203   73395   .       -       1       ID=EKO05_000002-T1.cds;Parent=EKO05_000002-T1;
ArME14_ctg_01   funannotate     gene    76053   77838   .       +       .       ID=EKO05_000003;

The annotation SQN file arising from such a GFF normally has PROTEIN_NAMES: All proteins have the same name "hypothetical protein" meaning the names are not transferred to the NCBI submission file

JWDebler commented 1 year ago

Hi, just wanted to check in if any decision has been made regarding EC numbers and 'hypothetical protein' products. Currently the output files still have both in them and NCBI will complain about that.

zacksaud commented 5 months ago

Hi,

Sorry to prompt again, but has there been any update on this issue? The NCBI have just rejected a submission of mine with the following message: "[2] There are a lot of hypothetical proteins that have EC numbers. We expect that a protein characterized enough to be given an EC number should have a product name. Please change the name or remove the EC number" FATAL! 859 protein features have an EC number and a protein name of 'unknown protein' or 'hypothetical protein'

Fredrick, thank you for the script, but how does one go about using it? If it's on the GFF file, do I use it and then re-run funannotate annotate directing the --gff flag to the new gff3, or do I need to run table2asn?

Many thanks in advance

Best

Zack

JWDebler commented 5 months ago

@zacksaud I wrote a script in the meantime (mentione above) that parses the funannotate GFF and looks for EC numbers, then pulls the corresponding enzyme name from the Expasy databse. I managed to get my annotations accepted by NCBI that way. I just updated the before mentioned script as some encoding has changed in the Expasy database file. Just ran it over a recent funannotate GFF and it works.

https://github.com/JWDebler/bioinformatics/blob/master/parse_EC_number_after_funannotate.py

needs the beautifulsoup4 package to run.

You just run the script using the funannotate gff as input:

python parse_EC_number_after_funannotate.py -i funannotate.gff" > funannotate.fixed.gff

And then rerun table2asn with the fixed gff.

Cheers, Johannes

zacksaud commented 5 months ago

@JWDebler that's amazing, thank you!

JWDebler commented 5 months ago

@zacksaud no worries. It's not ideal, but it solves the problem you have. Preferably something similar will get rolled into funannotate in the future including maybe batter EC allocation, as mentioned above EggNOG might not be the best.