nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
300 stars 82 forks source link

Error parsing XML GO terms: None is not a valid term #959

Closed aldendirks closed 9 months ago

aldendirks commented 9 months ago

When funannotate annotate attempts to parse the interproscan.xml file it throws the error:

Error parsing XML GO terms: None is not a valid term

I checked the interproscan.xml file and it seems OK. Looking at iprscan2annotations.py I noticed that it needs GO attributes of BIOLOGICAL_PROCESS, MOLECULAR_FUNCTION, or CELLULAR_COMPONENT, else it fails. I couldn't find any other kind of GO attribute, and I'm not sure why it is saying "None". I tried commenting out sys.exit(1) (leaving the print function) to see if it would work anyway but then the error I got was:

line 27, in convertGOattribute return attribute UnboundLocalError: local variable 'attribute' referenced before assignment

Any help would be much appreciated... so close to annotating a genome!

aldendirks commented 9 months ago

When I grep "GO" in the interproscan.xml file I see lots of things like this... maybe a "GO" entry without any following category information is a problem?

        <go-xref db="GO" id="GO:0016020"/>
        <go-xref db="GO" id="GO:0032991"/>
            <go-xref category="MOLECULAR_FUNCTION" db="GO" id="GO:0005515" name="protein binding"/>
            <go-xref category="MOLECULAR_FUNCTION" db="GO" id="GO:0005515" name="protein binding"/>
          <entry ac="IPR035445" desc="GYF-like domain superfamily" name="GYF-like_dom_sf" type="HOMOLOGOUS_SUPERFAMILY">
            <go-xref category="MOLECULAR_FUNCTION" db="GO" id="GO:0005515" name="protein binding"/>
            <go-xref category="MOLECULAR_FUNCTION" db="GO" id="GO:0005515" name="protein binding"/>
          <entry ac="IPR035445" desc="GYF-like domain superfamily" name="GYF-like_dom_sf" type="HOMOLOGOUS_SUPERFAMILY">
            <go-xref category="MOLECULAR_FUNCTION" db="GO" id="GO:0016747" name="acyltransferase activity, transferring groups other than amino-acyl groups"/>
            <go-xref category="MOLECULAR_FUNCTION" db="GO" id="GO:0016747" name="acyltransferase activity, transferring groups other than amino-acyl groups"/>
          <entry ac="IPR016181" desc="Acyl-CoA N-acyltransferase" name="Acyl_CoA_acyltransferase" type="HOMOLOGOUS_SUPERFAMILY">
        <go-xref db="GO" id="GO:0003700"/>
        <go-xref db="GO" id="GO:0043226"/>
        <go-xref db="GO" id="GO:0043565"/>
aldendirks commented 9 months ago

Here is the XML output for the first protein in an interproscan.xml file. Maybe the issue is all the lines of GO terms towards the bottom without any category information?

<?xml version="1.0" encoding="UTF-8"?><protein-matches xmlns="http://www.ebi.ac.uk/interpro/resources/schemas/interproscan5" interproscan-version="5.60-92.0">
  <protein>
    <sequence md5="9e4b4cc8d93c10ef376100d7ebfa07d0">MAPTKYTPLTLHFSDAVTNVYPRQVEKLVANDGSYEYFRALGENEQKDILWRSKIAKALVEKYLKNAKGDRLTETDTAKDYIFKTLPENYKLYEHVKGKRDEKSGGTISERRDTYLFGHPTGKRFRSPAEFVPHILHLAAQDDRPCECWICTGSKHGNPPTSVKKPTKRETEVTQARKVVALEERQREQETAGWVLRKGEVVWVWLSDNPEAEEASDDALIDGDGGLWVAGVVAERPSFTPPYQKVRKTTGNAFADIDMDDTPPTWQQEGGNVPEKTYIIQLCSDPPKLGQILKGVPQHHVKPWLSRQECAQAPPSYSGKIEHPSIPRARRVAETFSLFDRVSEPSDPPSASDPSPDAPKIANFQGVFLGAEKIYIHEPVRISSANEDEIEDVLVVDKIYTCTTTSESASSGSDGKKKTLTTTQFRGNVYTAYPSTTCTPLSSHQFTELPFRMRRGSGTGEIIKWFIRNVPEERGECSLKMILGRWYEPQAVNEWIGSTGFSGGLPSSKETAMCQKDVKRWVKNRADALGLVSVNGIDLKSEGEVKIQPGKLTSPLKPKPADATAEAMDVDEPPQVTPERGFKSVNLRISSVTPGSASSLKITPRTEADDAGIDGGDIEEEEQVEGDEDEEDEDDEATMSDDKYHQPGPEVLSRSPTKRLSK</sequence>
    <xref id="FUN_001952-T1" name="FUN_001952-T1 FUN_001952"/>
    <matches>
      <hmmer3-match evalue="5.9E-22" score="78.3">
        <signature ac="PF16761" desc="Transcription-silencing protein, cryptic loci regulator Clr2" name="Clr2_transil">
          <entry ac="IPR031915" desc="Cryptic loci regulator 2, N-terminal" name="Clr2_N" type="DOMAIN"/>
          <signature-library-release library="PFAM" version="35.0"/>
        </signature>
        <model-ac>PF16761</model-ac>
        <locations>
          <hmmer3-location env-end="151" env-start="81" post-processed="true" score="77.3" evalue="1.2E-21" hmm-start="1" hmm-end="68" hmm-length="68" hmm-bounds="COMPLETE" start="81" end="151">
            <location-fragments>
              <hmmer3-location-fragment start="81" end="151" dc-status="CONTINUOUS"/>
            </location-fragments>
          </hmmer3-location>
        </locations>
      </hmmer3-match>
      <hmmer3-match evalue="5.1E-14" score="53.2">
        <signature ac="PF10383" desc="Transcription-silencing protein Clr2" name="Clr2">
          <entry ac="IPR018839" desc="Cryptic loci regulator 2, C-terminal" name="Tscrpt-silencing_Clr2_C" type="DOMAIN"/>
          <signature-library-release library="PFAM" version="35.0"/>
        </signature>
        <model-ac>PF10383</model-ac>
        <locations>
          <hmmer3-location env-end="488" env-start="363" post-processed="true" score="51.1" evalue="2.2E-13" hmm-start="2" hmm-end="143" hmm-length="143" hmm-bounds="C_TERMINAL_COMPLETE" start="364" end="488">
            <location-fragments>
              <hmmer3-location-fragment start="364" end="488" dc-status="CONTINUOUS"/>
            </location-fragments>
          </hmmer3-location>
        </locations>
      </hmmer3-match>
      <mobidblite-match>
        <signature ac="mobidb-lite" desc="consensus disorder prediction" name="disorder_prediction">
          <signature-library-release library="MOBIDB_LITE" version="2.0"/>
        </signature>
        <model-ac>mobidb-lite</model-ac>
        <locations>
          <mobidblite-location sequence-feature="" start="549" end="662">
            <location-fragments>
              <mobidblite-location-fragment start="549" end="662" dc-status="CONTINUOUS"/>
            </location-fragments>
          </mobidblite-location>
        </locations>
      </mobidblite-match>
      <mobidblite-match>
        <signature ac="mobidb-lite" desc="consensus disorder prediction" name="disorder_prediction">
          <signature-library-release library="MOBIDB_LITE" version="2.0"/>
        </signature>
        <model-ac>mobidb-lite</model-ac>
        <locations>
          <mobidblite-location sequence-feature="Polar" start="584" end="603">
            <location-fragments>
              <mobidblite-location-fragment start="584" end="603" dc-status="CONTINUOUS"/>
            </location-fragments>
          </mobidblite-location>
        </locations>
      </mobidblite-match>
      <mobidblite-match>
        <signature ac="mobidb-lite" desc="consensus disorder prediction" name="disorder_prediction">
          <signature-library-release library="MOBIDB_LITE" version="2.0"/>
        </signature>
        <model-ac>mobidb-lite</model-ac>
        <locations>
          <mobidblite-location sequence-feature="Negative Polyelectrolyte" start="613" end="638">
            <location-fragments>
              <mobidblite-location-fragment start="613" end="638" dc-status="CONTINUOUS"/>
            </location-fragments>
          </mobidblite-location>
        </locations>
      </mobidblite-match>
      <panther-match ac="PTHR38046:SF1" evalue="5.9E-54" graft-point="PTN002866222" name="CRYPTIC LOCI REGULATOR 2" score="195.5">
        <signature ac="PTHR38046" name="CRYPTIC LOCI REGULATOR 2">
          <entry ac="IPR038986" desc="Cryptic loci regulator 2" name="Clr2" type="FAMILY">
            <go-xref category="BIOLOGICAL_PROCESS" db="GO" id="GO:0031507" name="heterochromatin formation"/>
            <go-xref category="CELLULAR_COMPONENT" db="GO" id="GO:0070824" name="SHREC complex"/>
          </entry>
          <signature-library-release library="PANTHER" version="17.0"/>
        </signature>
        <model-ac>PTHR38046:SF1</model-ac>
        <locations>
          <panther-location env-start="2" env-end="546" hmm-start="16" hmm-end="548" hmm-length="0" hmm-bounds="INCOMPLETE" start="4" end="494">
            <location-fragments>
              <panther-location-fragment start="4" end="494" dc-status="CONTINUOUS"/>
            </location-fragments>
          </panther-location>
        </locations>
        <go-xref db="GO" id="GO:0040029"/>
        <go-xref db="GO" id="GO:0043226"/>
        <go-xref db="GO" id="GO:0006996"/>
        <go-xref db="GO" id="GO:0009987"/>
        <go-xref db="GO" id="GO:0043229"/>
        <go-xref db="GO" id="GO:0043170"/>
        <go-xref db="GO" id="GO:0019538"/>
        <go-xref db="GO" id="GO:0000792"/>
        <go-xref db="GO" id="GO:0098732"/>
        <go-xref db="GO" id="GO:0009892"/>
        <go-xref db="GO" id="GO:0016570"/>
        <go-xref db="GO" id="GO:0010467"/>
        <go-xref db="GO" id="GO:0006464"/>
        <go-xref db="GO" id="GO:1901564"/>
        <go-xref db="GO" id="GO:0065007"/>
        <go-xref db="GO" id="GO:0045814"/>
        <go-xref db="GO" id="GO:0071840"/>
        <go-xref db="GO" id="GO:0110165"/>
        <go-xref db="GO" id="GO:0008152"/>
        <go-xref db="GO" id="GO:0006325"/>
        <go-xref db="GO" id="GO:0044238"/>
        <go-xref db="GO" id="GO:0070828"/>
        <go-xref db="GO" id="GO:0043412"/>
        <go-xref db="GO" id="GO:0050789"/>
        <go-xref db="GO" id="GO:0048519"/>
        <go-xref db="GO" id="GO:0044237"/>
        <go-xref db="GO" id="GO:0019222"/>
        <go-xref db="GO" id="GO:0005622"/>
        <go-xref db="GO" id="GO:0006807"/>
        <go-xref db="GO" id="GO:0043232"/>
        <go-xref db="GO" id="GO:0016043"/>
        <go-xref db="GO" id="GO:0010629"/>
        <go-xref db="GO" id="GO:0071103"/>
        <go-xref db="GO" id="GO:0051276"/>
        <go-xref db="GO" id="GO:0071704"/>
        <go-xref db="GO" id="GO:0005694"/>
        <go-xref db="GO" id="GO:0031507"/>
        <go-xref db="GO" id="GO:0060255"/>
        <go-xref db="GO" id="GO:0006323"/>
        <go-xref db="GO" id="GO:0044260"/>
        <go-xref db="GO" id="GO:0010468"/>
        <go-xref db="GO" id="GO:0036211"/>
        <go-xref db="GO" id="GO:0006476"/>
        <go-xref db="GO" id="GO:0016575"/>
        <go-xref db="GO" id="GO:0035601"/>
        <go-xref db="GO" id="GO:0000785"/>
        <go-xref db="GO" id="GO:0022607"/>
        <go-xref db="GO" id="GO:0031497"/>
        <go-xref db="GO" id="GO:0006333"/>
        <go-xref db="GO" id="GO:0044267"/>
        <go-xref db="GO" id="GO:0044085"/>
        <go-xref db="GO" id="GO:0043228"/>
        <go-xref db="GO" id="GO:0010605"/>
      </panther-match>
    </matches>
  </protein>
nextgenusfs commented 9 months ago

Not sure exactly what is wrong, I had some additional checks on that script, can you try to upgrade to latest and see if fixed?

python -m pip install git+https://github.com/nextgenusfs/funannotate.git --upgrade --force --no-deps
aldendirks commented 9 months ago

It worked!! I think...

This is the output... is that what you would epxect from a succesful annotation? There wasn't any message about number of annotations after Parsing InterProScan5 XML file like it reports for the other lines of evidence.

[Sep 06 04:24 PM]: OS: Red Hat Enterprise Linux 8.6, 36 cores, ~ 196 GB RAM. Python: 3.8.15
[Sep 06 04:24 PM]: Running 1.8.13
[Sep 06 04:24 PM]: Found existing output directory fun_out. Warning, will re-use any intermediate files found.
[Sep 06 04:24 PM]: Parsing input files
[Sep 06 04:24 PM]: Existing tbl found: fun_out/predict_results/Gyromitra_korfii_ACD0399.tbl
[Sep 06 04:24 PM]: Adding Functional Annotation to Gyromitra korfii, NCBI accession: None
[Sep 06 04:24 PM]: Annotation consists of: 10,893 gene models
[Sep 06 04:24 PM]: 10,671 protein records loaded
[Sep 06 04:24 PM]: Existing Pfam-A results found: fun_out/annotate_misc/annotations.pfam.txt
[Sep 06 04:24 PM]: 10,765 annotations added
[Sep 06 04:24 PM]: Running Diamond blastp search of UniProt DB version 2022_05
[Sep 06 04:24 PM]: 628 valid gene/product annotations from 863 total
[Sep 06 04:24 PM]: Existing Eggnog-mapper results found: fun_out/annotate_misc/eggnog.emapper.annotations
[Sep 06 04:24 PM]: Parsing EggNog Annotations
[Sep 06 04:24 PM]: EggNog version parsed as 2.1.9
[Sep 06 04:24 PM]: 16,624 COG and EggNog annotations added
[Sep 06 04:24 PM]: Combining UniProt/EggNog gene and product names using Gene2Product version 1.85
[Sep 06 04:24 PM]: 2,748 gene name and product description annotations added
[Sep 06 04:24 PM]: Existing MEROPS results found: fun_out/annotate_misc/annotations.merops.txt
[Sep 06 04:24 PM]: 224 annotations added
[Sep 06 04:24 PM]: Existing CAZYme results found: fun_out/annotate_misc/annotations.dbCAN.txt
[Sep 06 04:24 PM]: 266 annotations added
[Sep 06 04:24 PM]: Existing BUSCO2 results found: fun_out/annotate_misc/annotations.busco.txt
[Sep 06 04:24 PM]: 1,223 annotations added
[Sep 06 04:24 PM]: Existing Phobius results found: fun_out/annotate_misc/phobius.results.txt
[Sep 06 04:24 PM]: Existing SignalP results found: fun_out/annotate_misc/signalp.results.txt
[Sep 06 04:24 PM]: 672 secretome and 1,696 transmembane annotations added
[Sep 06 04:24 PM]: Parsing InterProScan5 XML file
[Sep 06 04:26 PM]: Now parsing antiSMASH v6 results, finding SM clusters
[Sep 06 04:26 PM]: Found 12 clusters, 0 biosynthetic enyzmes, and 0 smCOGs predicted by antiSMASH
[Sep 06 04:27 PM]: Found 0 duplicated annotations, adding 229,389 valid annotations
[Sep 06 04:27 PM]: Converting to final Genbank format, good luck!
[Sep 06 04:30 PM]: Creating AGP file and corresponding contigs file
[Sep 06 04:30 PM]: Cross referencing SM cluster hits with MIBiG database version 1.4
[Sep 06 04:30 PM]: Creating tab-delimited SM cluster output
[Sep 06 04:30 PM]: Writing genome annotation table.
[Sep 06 04:30 PM]: Funannotate annotate has completed successfully!
nextgenusfs commented 9 months ago

Yeah, looks correct. You can check the iprscan annotations by looking at the 3 column TSV file in annotate_misc/annotations.iprscan.txt, maybe just a cursory glance to see if it picked up the GO annotations. They will also show up in the final annotation table TSV file (annotate_results) folder.

aldendirks commented 9 months ago

All looks good, thanks so much for your help and writing this pipeline!