tanghaibao / goatools

Python library to handle Gene Ontology (GO) terms
BSD 2-Clause "Simplified" License
783 stars 210 forks source link

Difference between gene2go.gz and GO BP 2015 annotations #79

Closed iamjli closed 7 years ago

iamjli commented 7 years ago

Hi there,

Thanks for this code, it's exactly what I needed. I'm switching over from Enrichr (doesn't allow for background gene sets...), and they use GO BP 2015 annotations. I notice that the annotations in the file that they use is quite different than the annotations taken from NCBI, as was done in the tutorial.

Specifically, I plotted the number of terms for each GO pathway in the Enrichr annotations file vs. the NCBI file recommended in the tutorial (log2 scale), and found that there were many terms missing from the latter:

image

It appears that the NCBI annotations file is up to date. Could you please explain the discrepancy here?

Thanks!

dvklopfenstein commented 7 years ago

Thank you for your interest and time in GOATOOLS.

GOATOOLS has functions to read annotations from both https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz and http://www.geneontology.org/page/download-annotations.

You may want to read the annotations from a GAF file from the geneontology website rather than the NCBI website.

Reading a GAF file looks like this:

from goatools.associations import read_gaf
geneid2gos = read_gaf("goa_human.gaf")

Please let us know if this information meets your needs. If not, we can further discuss and come up with something that will hopefully be helpful to you. Thank you again for using GOATOOLS.

iamjli commented 7 years ago

Thanks for getting back to me! I am now receiving this error using 0.6.10

image

dvklopfenstein commented 7 years ago

It looks the file, 'goa_human.gaf' was not found.

Perhaps try using the full path name. Code like this should work when the filename matches a location. If you can get code like shown below, then your reading of the association should work. Hopefully, this is helpful. Please let us know...:

import os

fin = "/cygdrive/c/Users/dvklo/git/goatools/goa_human.gaf"
if os.path.isfile(fin):
    print("SUCCESS: {FIN} was found".format(FIN=fin))
else:
    raise Exception("FAILURE: {FIN} NOT FOUND".format(FIN=fin))
iamjli commented 7 years ago

Ok, so I downloaded goa_human.gaf from the GO link you provided, and it read the file fine. The new file didn't seem to change the density plot much though:

image

dvklopfenstein commented 7 years ago

You have me very curious. We are not the providers of the associations, but just read them in.

However, I'd like to check into this further just out of curiosity. Can you provide your plotting code and the web links for your big source files? I'll check it out...

iamjli commented 7 years ago

I see. Here's my code.

Here are the files used: goa_human.gz and GO_Biological_Process_2015.txt.

I previously used only pathways that were present in the Enrichr database to generate these figures, but that should be fixed in the notebook. Thanks for your help!

dvklopfenstein commented 7 years ago

Just wanted to get back to you. I will work on this after I address the other active issue. I am curious as to what you are seeing and why.

iamjli commented 7 years ago

@dvklopfenstein I emailed the Enrichr folks about this and they got back with this response:

I think the idea was to cut the GO tree at the third level so we do not get very general terms. Also, terms with less than five genes were removed. The lack over overlap with the current version could be due to GO standardizing and consolidating terms. It is likely that we processed the data from the same source but an older version.

Unfortunately the script used to process this data is missing, but he said he would let me know when they write another one...

dvklopfenstein commented 7 years ago

Thanks for the update.

Cutting the GO tree at the 3rd level may not be desirable because it removes BOTH general terms AND over 400 (in biological_process alone for human associations) specific leaf-level terms like:

GO:0021548 L03 pons development GO:0051026 L03 chiasma assembly GO:0097278 L02 complement-dependent cytotoxicity ...

Also, removing terms with less than 5 genes may not be desirable because:

  1. Leaf-level (or close to leaf-level) GO terms are those that generally have fewer gene associations.
  2. GO terms which are broader usually have more genes associated with them. So by pruning GO terms having less than 5 genes, you are removing thousands of specific GO terms. This may not be desirable.

Pruning the GO list by removing GO terms higher than the 3rd level and GO terms having less than 5 genes will lead to a loss of highly specific interesting GO terms.

So these items may be something to consider when thinking about pruning GO DAGs.

dvklopfenstein commented 7 years ago

I am going to close this issue. If you feel this is not correct, please open a new issue and we can continue the conversation. Your question was very interesting. Thank you for your interest in GOATOOLS and your time.