tanghaibao / goatools

Python library to handle Gene Ontology (GO) terms
BSD 2-Clause "Simplified" License
773 stars 211 forks source link

Error loading gaf file #195

Closed JanZrimec closed 3 years ago

JanZrimec commented 3 years ago

Hey I cannot load a gaf file saccharomyces genome database (ver 2.0, created 2018) with GafReader in jupyter notebook. Seems like the file is in a different format that what is trying to be read, with at least one missing column. Are there some settings or workarounds that would enable loading this file? Thanks!

Code: wget.download('http://downloads.yeastgenome.org/curation/literature/gene_association.sgd.gaf.gz') !gunzip gene_association.sgd.gaf.gz from goatools.anno.gaf_reader import GafReader objanno_sc = GafReader('gene_association.sgd.gaf')

Error meassage: BAD Extension( )

0) REQ DB SGD 1) REQ DB_ID S000001503 2) REQ DB_Symbol SPT23 3) Qualifier
4) REQ GO_ID GO:0003674 5) REQ DB_Reference GO_REF:0000015 6) REQ Evidence_Code ND 7) With_From
8) REQ NS F 9) DB_Name ER membrane protein involved in regulation of OLE1 transcription 10) DB_Synonym YKL020C 11) REQ DB_Type protein 12) REQ Taxon taxon:559292 13) REQ Date 20181102 14) REQ Assigned_By SGD 15) Extension

Traceback (most recent call last): File "/home/zrimec/miniconda3/envs/py36/lib/python3.6/site-packages/goatools/anno/init/reader_gaf.py", line 88, in _read_gaf_nts self._add_data0(nts, lnum, line, get_all_nss, namespaces, datobj) File "/home/zrimec/miniconda3/envs/py36/lib/python3.6/site-packages/goatools/anno/init/reader_gaf.py", line 108, in _add_data0 gafvals = datobj.get_gafvals(flds, nspc) File "/home/zrimec/miniconda3/envs/py36/lib/python3.6/site-packages/goatools/anno/init/reader_gaf.py", line 231, in get_gafvals flds[16] = self._get_set(flds[16].rstrip()) IndexError: list index out of range

**FATAL-gaf: list index out of range

**FATAL-gaf: gene_association.sgd.gaf[8]: SGD S000001503 SPT23 GO:0003674 GO_REF:0000015 ND F ER membrane protein involved in regulation of OLE1 transcription YKL020C protein taxon:559292 20181102 SGD An exception has occurred, use %tb to see the full traceback.

SystemExit: 1

dvklopfenstein commented 3 years ago

Thank you so much for your interest in GOA TOOLS and for taking your time to open this issue.

The file that you downloaded is a GAF 2.0 file and is expected to have 17 fields. Instead it has 16 fields. It is missing the field, Gene Product Form ID.

I can change the code to read this incorrect format, however it might be better to do either:

Please let me know what might work for your situation.

dvklopfenstein commented 3 years ago

FYI: I also opened an issue with the Gene Ontology Consortium letting them know that we saw this and I created a test for it:

https://github.com/geneontology/helpdesk/issues/292

dvklopfenstein commented 3 years ago

@JanZrimec , we have heard back from a researcher at the Gene Ontology Consortium regarding the incorrectly formatted GAF file.

They advise using the official GO product at http://current.geneontology.org/annotations/sgd.gaf.gz, which has been processed by GO and not only has the expected number of fields, but also has additional yeast annotations including those from the PAINT pipeline.

Also: For best results, please use the files found at http://current.geneontology.org.

JanZrimec commented 3 years ago

Thanks! This resolved the issue and was really helpful!