tanghaibao / goatools

Python library to handle Gene Ontology (GO) terms
BSD 2-Clause "Simplified" License
773 stars 211 forks source link

An example of GOEA with Uniprots #251

Closed Maryam-Haghani closed 3 months ago

Maryam-Haghani commented 1 year ago

Hi, I have noticed in the paper that we can use GPAD format from European Bioinformatics Institute’s FTP site for Gene Ontology Enrichment Analysis instead of NCBI's GAF format:

"The annotations are currently available for download from the GOC as GO Annotation Format (GAF), from NCBI’s FTP server in a gene2go format, or from the European Bioinformatics Institute’s FTP site in the Gene Product Association Data format (GPAD). GOATOOLS can efficiently parse these relevant file formats to retrieve rich attributes of each term..."

Based on this, as I have my data in Uniprot format, I am trying to work with the GPAD format to calculate over-represented GOs based on UniprotIDs. But, unfortunately, could not find any examples of that, the way it exists for NCBI gene data. Having a look at the python files in the repository, I came up to the code below which gives an AssertionError while reading the associations at GpadReader(fin_gpad). Could you please help me to resolve this issue?

Here is my code:

from goatools.base import download_go_basic_obo
from goatools.anno.dnld_ebi_goa import DnldGoa
from goatools.base import get_godag
from goatools.anno.gpad_reader import GpadReader

obo_fname = download_go_basic_obo()

objdnld = DnldGoa()
fin_gpad = objdnld.dnld_goa('human', 'gpa', None)

godag = get_godag()
objgpad = GpadReader(fin_gpad)

P.S.: For the last line, I also tried objgpad = GpadReader(fin_gpad, godag=godag), but nothing changed.

dvklopfenstein commented 1 year ago

Thank you for your interest in GOA Tools and for giving your time to write this issue.

Download the annotations before reading using wget:

$ wget http://geneontology.org/ontology/go-basic.obo
$ wget http://current.geneontology.org/annotations/goa_human.gpad.gz
$ gunzip goa_human.gpad.gz

Then read the annotations like this:

from goatools.base import get_godag
from goatools.anno.gpad_reader import GpadReader

godag = get_godag('go-basic.obo')
anno = GpadReader('goa_human.gpad', godag=godag)

I added a new notebook demonstrating how to read annotations from a GPAD file here

Maryam-Haghani commented 1 year ago

Thank you for your helpful answer @dvklopfenstein. As far as I see, you have downloaded the gpad file in your notebook from GO website that is gpa-version: 1.2 and using this version resolves my issue. But problem with my first code using goatools.anno.dnld_ebi_goa.py file is that it downloads the gpa file from _ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/goahuman.gpa.gz, with gpa-version: 1.1. As error details says, GpadReader cannot find some evidence code of this file in eco2group dictionary, and hence it raises an error. I think to get rid of the error, we have to change goatools.anno.dnld_ebi_goa.py file to use version 1.2 of the gpad file.

dvklopfenstein commented 1 year ago

Thank you very much for the error detail and the name and location of the annotation file. This is extremely helpful.

Likely, we need to update the ECO codes. This should go quickly. I'll have an update ASAP.

Thank you for your time to report the error on the ECO code and thank you for your interest in GOA Tools.

Maryam-Haghani commented 1 year ago

You are welcome! BTW, I used evidenceontology to have the updated ECO codes and have written a code to convert this file to a dictionary automatically. This code can be used in eco2group.py to have the ECO2GRP dictionary automatically rather than hard-coding.

Here is the code:

import pandas as pd

#Use eco file existing in evidenceontology
URL = "https://raw.githubusercontent.com/evidenceontology/evidenceontology/master/gaf-eco-mapping-derived.txt"

# Read file in df format
df_eco = pd.read_csv(URL, comment='#', sep="\t", header = None)

ECO2GRP = dict(df_eco[[0,1]].values)
dvklopfenstein commented 1 year ago

Very nice.

Thank you so much for the additional information regarding downloading the ECO codes to be used by GOA Tools.

Your idea works well for GOA Tools as our philosophy is to decouple from storing other group's databases, which benefits researchers by giving them the power to use the latest data or to cite a specific data version, which is needed when publishing.

Thank you so much. I will be sure to credit you as a contributor on the next pull request which implements the change you suggest.

Thank you for your time and for your interest in GOA Tools.

Maryam-Haghani commented 1 year ago

And thanks for your valuable work.