tanghaibao / goatools

Python library to handle Gene Ontology (GO) terms
BSD 2-Clause "Simplified" License
745 stars 212 forks source link

Discrepancies between using GoSearch and direct search on geneontology.org #243

Open jasperhyp opened 1 year ago

jasperhyp commented 1 year ago

Hi! Thanks for creating this wonderful library. I followed the instructions at https://github.com/tanghaibao/goatools/blob/main/notebooks/cell_cycle.ipynb to get GO-associated genes, but soon found out that there are some discrpancies between the results (in terms of # of ontology terms I get and also # of genes) derived here and from the website. I used go.obo (downloaded from here) instead of go-basic.obo to avoid data discrepancies.

I first noticed the number of GO terms hit are different. There are too many "cell cycle" hits so as an example, let's change "cell cycle" to "lipid metabolic". There are 9 hits on the website but 14 via the library. I manually examined the discrepancies and it looks like the library is capturing a lot of "\<something>lipid metabolic" which the website does not, but cannot explain why the library hits "regulation of lipid metabolic process" (GO:0019216) and "membrane lipid metabolic process" (GO:0006643) but not "regulation of membrane lipid metabolic process" (GO:1905038).

Then, I changed "lipid metabolic" to " lipid metabolic" (adding a blank in the front) to avoid hitting the "\<sth>lipid" stuff. Unexpectedly, there are only 6 hit terms left for the library (still 9 for the website). Weirdly, the library hit ether lipid metabolic process (GO:0046485) but not neutral lipid metabolic process (GO:0006638). There are more examples and I just couldn't understand why.

Yet again, to maximize replicability, I directly fed GO:0007049 (cell cycle) into srchhelp.add_children_gos() and I also went into that term through the website and ensured additional filtering (screenshot attached below). However, there are several hundred (unique) genes derived from the website but only 41 are derived from the library.

The code for feeding cell cycle directly is basically,

with open(fout_allgos, "w") as log:
    # Search for 'cell cycle' in GO terms
    # gos_cc_all = srchhelp.get_matching_gos(cell_cycle_all, prt=log)
    # Find any GOs matching 'cell cycle-independent' (e.g., "lysosome")
    # gos_no_cc = srchhelp.get_matching_gos(cell_cycle_not, gos=gos_cc_all, prt=log)
    # Remove GO terms that are not "cell cycle" GOs
    # gos = gos_cc_all.difference(gos_no_cc)
    # Add children GOs of cell cycle GOs
    # gos_all = srchhelp.add_children_gos(gos)
    gos_all = srchhelp.add_children_gos({'GO:0007049'})
    # Get Entrez GeneIDs for cell cycle GOs
    geneids = srchhelp.get_items(gos_all)

I would appreciate any help! I am new to GO so I might be doing many things wrongly. Please feel free to correct anything.

image

jasperhyp commented 1 year ago

I truly don't understand... The biological processes in go.obo were just not hit by exactly the same queries in many cases...