tanghaibao / goatools

Python library to handle Gene Ontology (GO) terms
BSD 2-Clause "Simplified" License
783 stars 210 forks source link

[Question] How to parse .json, .obo, or .owl to get dictionary of enzymes {id_go:{ec_1, ec_2, ..., ec_n}} #292

Closed jolespin closed 7 months ago

jolespin commented 7 months ago

I'm trying to understand how I can use GOTATOOLS to parse any of the GO files to yield a dictionary that has the following structure:

{id_go: {ec_1, ec_2, ..., ec_n}}

I was able to load the obo file but I couldn't figure out how to get the enzymes:


from goatools.base import get_godag

godag = get_godag('Databases/GO/go-basic.obo', optional_attrs='relationship')
go = godag['GO:0000015']

for id_go, go in godag.items():

    print(id_go, go.get_all_children())
#GO:0000001 set()
#GO:0000002 set()
#GO:0000006 set()
#GO:0000007 set()
#GO:0000009 {'GO:0033164', 'GO:0052917'}

They are definitely in there, I just don't how to access them:

%%bash
grep -c "EC:" /Users/jolespin/Databases/GO/go-basic.obo

# Databases/GO/go-basic.obo:26098
tanghaibao commented 7 months ago

@jolespin

You are close, EC number is under xref (you can check which field they are under in the .obo file).

Here are some sample code:

from goatools.base import get_godag

godag = get_godag("go-basic.obo", optional_attrs="xref")

for id_go, go in godag.items():
    ecs = [x for x in go.xref if x.startswith("EC:")]
    if ecs:
        print(id_go, ecs)

This prints out:

...
GO:0008557 ['EC:7.6.2.1']
GO:1901237 ['EC:7.3.2.6']
GO:0090450 ['EC:3.6.1.64']
GO:0043851 ['EC:2.1.1.246']