Closed aldendirks closed 9 months ago
When I grep
"GO" in the interproscan.xml
file I see lots of things like this... maybe a "GO" entry without any following category information is a problem?
<go-xref db="GO" id="GO:0016020"/>
<go-xref db="GO" id="GO:0032991"/>
<go-xref category="MOLECULAR_FUNCTION" db="GO" id="GO:0005515" name="protein binding"/>
<go-xref category="MOLECULAR_FUNCTION" db="GO" id="GO:0005515" name="protein binding"/>
<entry ac="IPR035445" desc="GYF-like domain superfamily" name="GYF-like_dom_sf" type="HOMOLOGOUS_SUPERFAMILY">
<go-xref category="MOLECULAR_FUNCTION" db="GO" id="GO:0005515" name="protein binding"/>
<go-xref category="MOLECULAR_FUNCTION" db="GO" id="GO:0005515" name="protein binding"/>
<entry ac="IPR035445" desc="GYF-like domain superfamily" name="GYF-like_dom_sf" type="HOMOLOGOUS_SUPERFAMILY">
<go-xref category="MOLECULAR_FUNCTION" db="GO" id="GO:0016747" name="acyltransferase activity, transferring groups other than amino-acyl groups"/>
<go-xref category="MOLECULAR_FUNCTION" db="GO" id="GO:0016747" name="acyltransferase activity, transferring groups other than amino-acyl groups"/>
<entry ac="IPR016181" desc="Acyl-CoA N-acyltransferase" name="Acyl_CoA_acyltransferase" type="HOMOLOGOUS_SUPERFAMILY">
<go-xref db="GO" id="GO:0003700"/>
<go-xref db="GO" id="GO:0043226"/>
<go-xref db="GO" id="GO:0043565"/>
Here is the XML output for the first protein in an interproscan.xml file. Maybe the issue is all the lines of GO terms towards the bottom without any category information?
<?xml version="1.0" encoding="UTF-8"?><protein-matches xmlns="http://www.ebi.ac.uk/interpro/resources/schemas/interproscan5" interproscan-version="5.60-92.0">
<protein>
<sequence md5="9e4b4cc8d93c10ef376100d7ebfa07d0">MAPTKYTPLTLHFSDAVTNVYPRQVEKLVANDGSYEYFRALGENEQKDILWRSKIAKALVEKYLKNAKGDRLTETDTAKDYIFKTLPENYKLYEHVKGKRDEKSGGTISERRDTYLFGHPTGKRFRSPAEFVPHILHLAAQDDRPCECWICTGSKHGNPPTSVKKPTKRETEVTQARKVVALEERQREQETAGWVLRKGEVVWVWLSDNPEAEEASDDALIDGDGGLWVAGVVAERPSFTPPYQKVRKTTGNAFADIDMDDTPPTWQQEGGNVPEKTYIIQLCSDPPKLGQILKGVPQHHVKPWLSRQECAQAPPSYSGKIEHPSIPRARRVAETFSLFDRVSEPSDPPSASDPSPDAPKIANFQGVFLGAEKIYIHEPVRISSANEDEIEDVLVVDKIYTCTTTSESASSGSDGKKKTLTTTQFRGNVYTAYPSTTCTPLSSHQFTELPFRMRRGSGTGEIIKWFIRNVPEERGECSLKMILGRWYEPQAVNEWIGSTGFSGGLPSSKETAMCQKDVKRWVKNRADALGLVSVNGIDLKSEGEVKIQPGKLTSPLKPKPADATAEAMDVDEPPQVTPERGFKSVNLRISSVTPGSASSLKITPRTEADDAGIDGGDIEEEEQVEGDEDEEDEDDEATMSDDKYHQPGPEVLSRSPTKRLSK</sequence>
<xref id="FUN_001952-T1" name="FUN_001952-T1 FUN_001952"/>
<matches>
<hmmer3-match evalue="5.9E-22" score="78.3">
<signature ac="PF16761" desc="Transcription-silencing protein, cryptic loci regulator Clr2" name="Clr2_transil">
<entry ac="IPR031915" desc="Cryptic loci regulator 2, N-terminal" name="Clr2_N" type="DOMAIN"/>
<signature-library-release library="PFAM" version="35.0"/>
</signature>
<model-ac>PF16761</model-ac>
<locations>
<hmmer3-location env-end="151" env-start="81" post-processed="true" score="77.3" evalue="1.2E-21" hmm-start="1" hmm-end="68" hmm-length="68" hmm-bounds="COMPLETE" start="81" end="151">
<location-fragments>
<hmmer3-location-fragment start="81" end="151" dc-status="CONTINUOUS"/>
</location-fragments>
</hmmer3-location>
</locations>
</hmmer3-match>
<hmmer3-match evalue="5.1E-14" score="53.2">
<signature ac="PF10383" desc="Transcription-silencing protein Clr2" name="Clr2">
<entry ac="IPR018839" desc="Cryptic loci regulator 2, C-terminal" name="Tscrpt-silencing_Clr2_C" type="DOMAIN"/>
<signature-library-release library="PFAM" version="35.0"/>
</signature>
<model-ac>PF10383</model-ac>
<locations>
<hmmer3-location env-end="488" env-start="363" post-processed="true" score="51.1" evalue="2.2E-13" hmm-start="2" hmm-end="143" hmm-length="143" hmm-bounds="C_TERMINAL_COMPLETE" start="364" end="488">
<location-fragments>
<hmmer3-location-fragment start="364" end="488" dc-status="CONTINUOUS"/>
</location-fragments>
</hmmer3-location>
</locations>
</hmmer3-match>
<mobidblite-match>
<signature ac="mobidb-lite" desc="consensus disorder prediction" name="disorder_prediction">
<signature-library-release library="MOBIDB_LITE" version="2.0"/>
</signature>
<model-ac>mobidb-lite</model-ac>
<locations>
<mobidblite-location sequence-feature="" start="549" end="662">
<location-fragments>
<mobidblite-location-fragment start="549" end="662" dc-status="CONTINUOUS"/>
</location-fragments>
</mobidblite-location>
</locations>
</mobidblite-match>
<mobidblite-match>
<signature ac="mobidb-lite" desc="consensus disorder prediction" name="disorder_prediction">
<signature-library-release library="MOBIDB_LITE" version="2.0"/>
</signature>
<model-ac>mobidb-lite</model-ac>
<locations>
<mobidblite-location sequence-feature="Polar" start="584" end="603">
<location-fragments>
<mobidblite-location-fragment start="584" end="603" dc-status="CONTINUOUS"/>
</location-fragments>
</mobidblite-location>
</locations>
</mobidblite-match>
<mobidblite-match>
<signature ac="mobidb-lite" desc="consensus disorder prediction" name="disorder_prediction">
<signature-library-release library="MOBIDB_LITE" version="2.0"/>
</signature>
<model-ac>mobidb-lite</model-ac>
<locations>
<mobidblite-location sequence-feature="Negative Polyelectrolyte" start="613" end="638">
<location-fragments>
<mobidblite-location-fragment start="613" end="638" dc-status="CONTINUOUS"/>
</location-fragments>
</mobidblite-location>
</locations>
</mobidblite-match>
<panther-match ac="PTHR38046:SF1" evalue="5.9E-54" graft-point="PTN002866222" name="CRYPTIC LOCI REGULATOR 2" score="195.5">
<signature ac="PTHR38046" name="CRYPTIC LOCI REGULATOR 2">
<entry ac="IPR038986" desc="Cryptic loci regulator 2" name="Clr2" type="FAMILY">
<go-xref category="BIOLOGICAL_PROCESS" db="GO" id="GO:0031507" name="heterochromatin formation"/>
<go-xref category="CELLULAR_COMPONENT" db="GO" id="GO:0070824" name="SHREC complex"/>
</entry>
<signature-library-release library="PANTHER" version="17.0"/>
</signature>
<model-ac>PTHR38046:SF1</model-ac>
<locations>
<panther-location env-start="2" env-end="546" hmm-start="16" hmm-end="548" hmm-length="0" hmm-bounds="INCOMPLETE" start="4" end="494">
<location-fragments>
<panther-location-fragment start="4" end="494" dc-status="CONTINUOUS"/>
</location-fragments>
</panther-location>
</locations>
<go-xref db="GO" id="GO:0040029"/>
<go-xref db="GO" id="GO:0043226"/>
<go-xref db="GO" id="GO:0006996"/>
<go-xref db="GO" id="GO:0009987"/>
<go-xref db="GO" id="GO:0043229"/>
<go-xref db="GO" id="GO:0043170"/>
<go-xref db="GO" id="GO:0019538"/>
<go-xref db="GO" id="GO:0000792"/>
<go-xref db="GO" id="GO:0098732"/>
<go-xref db="GO" id="GO:0009892"/>
<go-xref db="GO" id="GO:0016570"/>
<go-xref db="GO" id="GO:0010467"/>
<go-xref db="GO" id="GO:0006464"/>
<go-xref db="GO" id="GO:1901564"/>
<go-xref db="GO" id="GO:0065007"/>
<go-xref db="GO" id="GO:0045814"/>
<go-xref db="GO" id="GO:0071840"/>
<go-xref db="GO" id="GO:0110165"/>
<go-xref db="GO" id="GO:0008152"/>
<go-xref db="GO" id="GO:0006325"/>
<go-xref db="GO" id="GO:0044238"/>
<go-xref db="GO" id="GO:0070828"/>
<go-xref db="GO" id="GO:0043412"/>
<go-xref db="GO" id="GO:0050789"/>
<go-xref db="GO" id="GO:0048519"/>
<go-xref db="GO" id="GO:0044237"/>
<go-xref db="GO" id="GO:0019222"/>
<go-xref db="GO" id="GO:0005622"/>
<go-xref db="GO" id="GO:0006807"/>
<go-xref db="GO" id="GO:0043232"/>
<go-xref db="GO" id="GO:0016043"/>
<go-xref db="GO" id="GO:0010629"/>
<go-xref db="GO" id="GO:0071103"/>
<go-xref db="GO" id="GO:0051276"/>
<go-xref db="GO" id="GO:0071704"/>
<go-xref db="GO" id="GO:0005694"/>
<go-xref db="GO" id="GO:0031507"/>
<go-xref db="GO" id="GO:0060255"/>
<go-xref db="GO" id="GO:0006323"/>
<go-xref db="GO" id="GO:0044260"/>
<go-xref db="GO" id="GO:0010468"/>
<go-xref db="GO" id="GO:0036211"/>
<go-xref db="GO" id="GO:0006476"/>
<go-xref db="GO" id="GO:0016575"/>
<go-xref db="GO" id="GO:0035601"/>
<go-xref db="GO" id="GO:0000785"/>
<go-xref db="GO" id="GO:0022607"/>
<go-xref db="GO" id="GO:0031497"/>
<go-xref db="GO" id="GO:0006333"/>
<go-xref db="GO" id="GO:0044267"/>
<go-xref db="GO" id="GO:0044085"/>
<go-xref db="GO" id="GO:0043228"/>
<go-xref db="GO" id="GO:0010605"/>
</panther-match>
</matches>
</protein>
Not sure exactly what is wrong, I had some additional checks on that script, can you try to upgrade to latest and see if fixed?
python -m pip install git+https://github.com/nextgenusfs/funannotate.git --upgrade --force --no-deps
It worked!! I think...
This is the output... is that what you would epxect from a succesful annotation? There wasn't any message about number of annotations after Parsing InterProScan5 XML file
like it reports for the other lines of evidence.
[Sep 06 04:24 PM]: OS: Red Hat Enterprise Linux 8.6, 36 cores, ~ 196 GB RAM. Python: 3.8.15
[Sep 06 04:24 PM]: Running 1.8.13
[Sep 06 04:24 PM]: Found existing output directory fun_out. Warning, will re-use any intermediate files found.
[Sep 06 04:24 PM]: Parsing input files
[Sep 06 04:24 PM]: Existing tbl found: fun_out/predict_results/Gyromitra_korfii_ACD0399.tbl
[Sep 06 04:24 PM]: Adding Functional Annotation to Gyromitra korfii, NCBI accession: None
[Sep 06 04:24 PM]: Annotation consists of: 10,893 gene models
[Sep 06 04:24 PM]: 10,671 protein records loaded
[Sep 06 04:24 PM]: Existing Pfam-A results found: fun_out/annotate_misc/annotations.pfam.txt
[Sep 06 04:24 PM]: 10,765 annotations added
[Sep 06 04:24 PM]: Running Diamond blastp search of UniProt DB version 2022_05
[Sep 06 04:24 PM]: 628 valid gene/product annotations from 863 total
[Sep 06 04:24 PM]: Existing Eggnog-mapper results found: fun_out/annotate_misc/eggnog.emapper.annotations
[Sep 06 04:24 PM]: Parsing EggNog Annotations
[Sep 06 04:24 PM]: EggNog version parsed as 2.1.9
[Sep 06 04:24 PM]: 16,624 COG and EggNog annotations added
[Sep 06 04:24 PM]: Combining UniProt/EggNog gene and product names using Gene2Product version 1.85
[Sep 06 04:24 PM]: 2,748 gene name and product description annotations added
[Sep 06 04:24 PM]: Existing MEROPS results found: fun_out/annotate_misc/annotations.merops.txt
[Sep 06 04:24 PM]: 224 annotations added
[Sep 06 04:24 PM]: Existing CAZYme results found: fun_out/annotate_misc/annotations.dbCAN.txt
[Sep 06 04:24 PM]: 266 annotations added
[Sep 06 04:24 PM]: Existing BUSCO2 results found: fun_out/annotate_misc/annotations.busco.txt
[Sep 06 04:24 PM]: 1,223 annotations added
[Sep 06 04:24 PM]: Existing Phobius results found: fun_out/annotate_misc/phobius.results.txt
[Sep 06 04:24 PM]: Existing SignalP results found: fun_out/annotate_misc/signalp.results.txt
[Sep 06 04:24 PM]: 672 secretome and 1,696 transmembane annotations added
[Sep 06 04:24 PM]: Parsing InterProScan5 XML file
[Sep 06 04:26 PM]: Now parsing antiSMASH v6 results, finding SM clusters
[Sep 06 04:26 PM]: Found 12 clusters, 0 biosynthetic enyzmes, and 0 smCOGs predicted by antiSMASH
[Sep 06 04:27 PM]: Found 0 duplicated annotations, adding 229,389 valid annotations
[Sep 06 04:27 PM]: Converting to final Genbank format, good luck!
[Sep 06 04:30 PM]: Creating AGP file and corresponding contigs file
[Sep 06 04:30 PM]: Cross referencing SM cluster hits with MIBiG database version 1.4
[Sep 06 04:30 PM]: Creating tab-delimited SM cluster output
[Sep 06 04:30 PM]: Writing genome annotation table.
[Sep 06 04:30 PM]: Funannotate annotate has completed successfully!
Yeah, looks correct. You can check the iprscan annotations by looking at the 3 column TSV file in annotate_misc/annotations.iprscan.txt
, maybe just a cursory glance to see if it picked up the GO annotations. They will also show up in the final annotation table TSV file (annotate_results) folder.
All looks good, thanks so much for your help and writing this pipeline!
When
funannotate annotate
attempts to parse theinterproscan.xml
file it throws the error:Error parsing XML GO terms: None is not a valid term
I checked the interproscan.xml file and it seems OK. Looking at
iprscan2annotations.py
I noticed that it needs GO attributes of BIOLOGICAL_PROCESS, MOLECULAR_FUNCTION, or CELLULAR_COMPONENT, else it fails. I couldn't find any other kind of GO attribute, and I'm not sure why it is saying "None". I tried commenting outsys.exit(1)
(leaving the print function) to see if it would work anyway but then the error I got was:line 27, in convertGOattribute return attribute UnboundLocalError: local variable 'attribute' referenced before assignment
Any help would be much appreciated... so close to annotating a genome!