Closed iamjli closed 7 years ago
Thank you for your interest and time in GOATOOLS.
GOATOOLS has functions to read annotations from both https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz and http://www.geneontology.org/page/download-annotations.
You may want to read the annotations from a GAF file from the geneontology website rather than the NCBI website.
Reading a GAF file looks like this:
from goatools.associations import read_gaf
geneid2gos = read_gaf("goa_human.gaf")
Please let us know if this information meets your needs. If not, we can further discuss and come up with something that will hopefully be helpful to you. Thank you again for using GOATOOLS.
Thanks for getting back to me! I am now receiving this error using 0.6.10
It looks the file, 'goa_human.gaf' was not found.
Perhaps try using the full path name. Code like this should work when the filename matches a location. If you can get code like shown below, then your reading of the association should work. Hopefully, this is helpful. Please let us know...:
import os
fin = "/cygdrive/c/Users/dvklo/git/goatools/goa_human.gaf"
if os.path.isfile(fin):
print("SUCCESS: {FIN} was found".format(FIN=fin))
else:
raise Exception("FAILURE: {FIN} NOT FOUND".format(FIN=fin))
Ok, so I downloaded goa_human.gaf
from the GO link you provided, and it read the file fine. The new file didn't seem to change the density plot much though:
You have me very curious. We are not the providers of the associations, but just read them in.
However, I'd like to check into this further just out of curiosity. Can you provide your plotting code and the web links for your big source files? I'll check it out...
I see. Here's my code.
Here are the files used: goa_human.gz and GO_Biological_Process_2015.txt.
I previously used only pathways that were present in the Enrichr database to generate these figures, but that should be fixed in the notebook. Thanks for your help!
Just wanted to get back to you. I will work on this after I address the other active issue. I am curious as to what you are seeing and why.
@dvklopfenstein I emailed the Enrichr folks about this and they got back with this response:
I think the idea was to cut the GO tree at the third level so we do not get very general terms. Also, terms with less than five genes were removed. The lack over overlap with the current version could be due to GO standardizing and consolidating terms. It is likely that we processed the data from the same source but an older version.
Unfortunately the script used to process this data is missing, but he said he would let me know when they write another one...
Thanks for the update.
Cutting the GO tree at the 3rd level may not be desirable because it removes BOTH general terms AND over 400 (in biological_process alone for human associations) specific leaf-level terms like:
GO:0021548 L03 pons development GO:0051026 L03 chiasma assembly GO:0097278 L02 complement-dependent cytotoxicity ...
Also, removing terms with less than 5 genes may not be desirable because:
Pruning the GO list by removing GO terms higher than the 3rd level and GO terms having less than 5 genes will lead to a loss of highly specific interesting GO terms.
So these items may be something to consider when thinking about pruning GO DAGs.
I am going to close this issue. If you feel this is not correct, please open a new issue and we can continue the conversation. Your question was very interesting. Thank you for your interest in GOATOOLS and your time.
Hi there,
Thanks for this code, it's exactly what I needed. I'm switching over from Enrichr (doesn't allow for background gene sets...), and they use GO BP 2015 annotations. I notice that the annotations in the file that they use is quite different than the annotations taken from NCBI, as was done in the tutorial.
Specifically, I plotted the number of terms for each GO pathway in the Enrichr annotations file vs. the NCBI file recommended in the tutorial (log2 scale), and found that there were many terms missing from the latter:
It appears that the NCBI annotations file is up to date. Could you please explain the discrepancy here?
Thanks!