jokergoo / rGREAT_genesets

Genesets and gene coordinates for a lot of organisms
https://jokergoo.github.io/rGREAT_genesets/
0 stars 0 forks source link

Missing GO gene sets? #1

Open gexijin opened 2 years ago

gexijin commented 2 years ago

Hi Zuguang, Thank you for making these genesets available. We have done similar things and used it to power our tools such as ShinyGO and iDEP. To make sure we are doing things correctly, I compared our GO genesets with yours. I focused on the first species on your list: Pink-footed goose.
Combining BP, CC, and MF, your file contains 14,540 terms. But our file has 18,928 terms. There are 14,491 terms in common. I found that there are 4437 GO Terms not included in your file and 49 not in ours. For example, GO:0000323 is missing from your file.

Both are derived from Ensembl BiomaRt and the only difference is how the sets are compiled. We used the GeneSetCollection function from the GSEABase package.

Initially, I am afraid of missing terms in our data as I did not remember to collect genes for all child terms. I am not sure if this worth looking into from your side.

Here is our code:

library(biomaRt)
library("GO.db")
library("GSEABase")
library(RCurl)
myMart <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "cabingdonii_gene_ensembl")
try(goframeData <- getBM(
  attributes = c("go_id", "go_linkage_type", "ensembl_gene_id"),
  filters = "biotype", values = "protein_coding",
  mart = myMart
))
# Remove if evidence is missing
goframeData <- goframeData[which(goframeData[, 2] != ""), ]
# Remove if evidence is missing
goframeData <- goframeData[which(goframeData[, 3] != "NA"), ]

goFrame <- GOFrame(goframeData)
rm(goframeData)
goAllFrame <- GOAllFrame(goFrame)
rm(goFrame)
# Build gene set collection
gsc <- GeneSetCollection(goAllFrame, setType = GOCollection())
gene_sets <- geneIds(gsc)
jokergoo commented 2 years ago

Hi, thanks for the information! Actually I know your resources :)

With your code, GO:0000323 is not there neither.

> myMart <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "abrachyrhynchus_gene_ensembl")
> try(goframeData <- getBM(
+   attributes = c("go_id", "go_linkage_type", "ensembl_gene_id"),
+   filters = "biotype", values = "protein_coding",
+   mart = myMart
+ ))
> goframeData[goframeData$go_id == "GO:0000323", ]
[1] go_id           go_linkage_type ensembl_gene_id
<0 rows> (or 0-length row.names)

I guess it is because the version of GO on Ensembl/BioMart database is different when I and you retrieved data from there.

If you go to the GO webpage of GO:0000323 (http://amigo.geneontology.org/amigo/term/GO:0000323) and select the organism from the left side bar, actually there is no annotation for this organism pink-footed goose (Anser brachyrhynchus):

image
gexijin commented 2 years ago

Hi Zuguang, If you run these lines to build the genesets that category "GO:0000323" is there. The hierarchical GO structure is different.

goFrame <- GOFrame(goframeData)
rm(goframeData)
goAllFrame <- GOAllFrame(goFrame)
rm(goFrame)
# Build gene set collection
gsc <- GeneSetCollection(goAllFrame, setType = GOCollection())
jokergoo commented 2 years ago

So this means GeneSetCollection() also uses GO terms that are not in goframeData?

gexijin commented 2 years ago

Probably. I think they did something like what you did, recursively get all genes associated with child-terms. I am not an expert on GO, we just have been using this for a while. I am not 100% the GSEABase package is correct though.

jokergoo commented 2 years ago

I will have a look. I haven't used GSEABase neither the functions GOFrame() nor GOAllFrame().

jokergoo commented 2 years ago

I think when using GOFrame() or GOAllFrame(), the organism should also be set via the organism argument. I am not clear how the organism information is used in to the complex source code (https://code.bioconductor.org/browse/AnnotationDbi/blob/master/R/GOTerms.R), but it definitely affects the final results.

jokergoo commented 2 years ago

I think I found the reason. If a parent GO term A is identical to its child node B (with the same set of genes, the union of all its child nodes), then in goframeData, A is not there and therefore it is not included in my list. I will fix this bug. Thanks!

gexijin commented 2 years ago

That is not necessarily a bug, as you could argue that A and B are redundant and B is a more specific Term to have.

jokergoo commented 2 years ago

Now the numbers are quite similar:

   GO.db GOFrame
BP 13621   13624
CC  1686    1685
MF  3694    3695

The number of annotated genes:

image

Generally, now the two sources are very similar, except a very few terms are annotated to different sets of genes, which I guess due to the different GO sources.

E.g. the gene ENSABRG00000005043 is in GO:0000395 in GO.db (also checked on Ensembl website https://www.ensembl.org/Anser_brachyrhynchus/Gene/Ontologies/biological_process?db=core;g=ENSABRG00000005043;r=NXHY01000315.1:63864-64021;t=ENSABRT00000007824), but it is not in GO:0000395 by GOFrame.