ssayols / rrvgo

Reduce + Visualize Gene Ontology
GNU General Public License v3.0
21 stars 3 forks source link

No terms were found in orgdb #19

Closed tamsinlh closed 11 months ago

tamsinlh commented 11 months ago

Hello.

I am studying an organism that requires me to make an orgdb object. I followed the tutorial to do so with AnnotationForge using both my organism and the one it provides as an example, but when running calculateSimMatrix I recieve the following error:

No terms were found in orgdb for BP Check that the organism and ontology match the ones provided by orgdb (I am running rrvgo version 1.12.22.)

Any help would be appreciated. I would very much like to use this program.

Thank you in advance.

ssayols commented 11 months ago

Hi @tamsinlh , Looks like the ontology terms are not where rrvgo expects to find them in the orgdb object. Would you mind to share the orgdb object, and perhaps the R script you wrote to generate it? Something that I could use to reproduce your issue in my end.

cheers

tamsinlh commented 11 months ago

@ssayols Thank you. Below is the code you've requested. It doesn't appear to allow attaching the orgdb object. I would be happy to send it to you if you would be able to direct me.

Many thanks.


gsnapperFile <- system.file("extdata","GSnap.CorrectEggnog.txt",
                            package="AnnotationForge")
gsnapper <- read.table(gsnapperFile,sep="\t", head=TRUE)
head(gsnapper)
colnames(gsnapper)
#[1] "query"          "seed_ortholog"  "evalue"         "score"         
#[5] "eggNOG_OGs"     "max_annot_lvl"  "COG_category"   "Description"   
#[9] "Preferred_name" "GOs"            "EC"             "KEGG_ko"       
#[13] "KEGG_Pathway"   "KEGG_Module"    "KEGG_Reaction"  "KEGG_rclass"   
#[17] "BRITE"          "KEGG_TC"        "CAZy"           "BiGG_Reaction" 
#[21] "PFAMs"   
GSym <- gsnapper[,c(1,9,8)]
GSym <- GSym[GSym[,2]!="-",]
GSym <- GSym[GSym[,3]!="-",]
colnames(GSym) <- c("GID","SYMBOL","GENENAME")
head(GSym)
#GID SYMBOL
#4   g6.t1    TNC
#8  g12.t1   RTL1
#17 g36.t1  HERC1
#27 g63.t1  MYPOP
#GENENAME
#4                                                                                  Tenascin C
#8                                                          multicellular organism development
#17                 HECT and RLD domain containing E3 ubiquitin protein ligase family member 1
#27 proximal promoter DNA-binding transcription repressor activity, RNA polymerase II-specific

gsnapperGeneFile <- system.file("extdata","annoData.genes.txt",
                                package="AnnotationForge")
gsnapperGenes <- read.table(gsnapperGeneFile,sep="\t", head=TRUE)
head(gsnapperGenes)
gsChr <- gsnapperGenes[,c(1,2)]
gsChr <- gsChr[gsChr[,2]!="-",]
colnames(gsChr) <- c("GID","CHROMOSOME")
head(gsChr)
#GID CHROMOSOME
#1     g1.t1          1
#2    g10.t1          1
#3   g100.t1          1
#4  g1000.t1          1

gsnapperGOFile <- system.file("extdata","GSnap.GOsAsList4.tsv",
                              package="AnnotationForge")
gsGO <- read.table(gsnapperGOFile,sep="\t", header=TRUE)
gsGO <- gsGO[gsGO[,2]!="-",]
gsGO <- gsGO[gsGO[,3]!="",]
colnames(gsGO) <- c("GID","GO", "EVIDENCE")
head(gsGO)
#GID         GO EVIDENCE
#4 g6.t1 GO:0000003      IEA
#5 g6.t1 GO:0001101      IEA
#6 g6.t1 GO:0001655      IEA
#7 g6.t1 GO:0001763      IEA

makeOrgPackage(gene_info=GSym, chromosome=gsChr, go=gsGO,
               version="0.2",
               maintainer="Some One <so@someplace.org>",
               author="Some One <so@someplace.org>",
               outputDir = ".",
               tax_id="40503",
               genus="Lutjanus",
               species="griseusR",
               goTable="go")
#Warning messages:
#1: ggrepel: 16 unlabeled data points (too many overlaps). Consider increasing max.overlaps 
#2: ggrepel: 10 unlabeled data points (too many overlaps). Consider increasing max.overlaps 
#3: ggrepel: 10 unlabeled data points (too many overlaps). Consider increasing max.overlaps 
Sys.setenv(R_INSTALL_STAGED = FALSE)
install.packages("./org.LgriseusR.eg.db", repos=NULL, type = "source")
library(org.LgriseusR.eg.db)
my_new_fancy_orgdb_object <- 'org.LgriseusR.eg.db'
my_new_fancy_orgdb_object

go_analysis <- read.delim(system.file("extdata/CHSA2019_SigDiffMeth2.txt", package="rrvgo"))
head(go_analysis)
#    ID Avg_Meth_Diff  geneID                          Description
#1          -     -21.32544 FAM124A Family with sequence similarity 124A
#2 GO:0001766     -17.09158    PLLP                          Plasmolipin
#3 GO:0003674     -17.09158    PLLP                          Plasmolipin
#4 GO:0005198     -17.09158    PLLP                          Plasmolipin
#5 GO:0005575     -17.09158    PLLP                          Plasmolipin
#6 GO:0005623     -17.09158    PLLP                          Plasmolipin
simMatrix2 <- calculateSimMatrix(go_analysis,
                                 orgdb="org.LgriseusR.eg.db",
                                 semdata = GOSemSim::godata(my_new_fancy_orgdb_object, ont="BP",keytype = 'GID'),
                                 method="Rel",
                                 keytype = "GID")
#preparing gene to GO mapping data...
#preparing IC data...
#Warning message:
#  In calculateSimMatrix(go_analysis, orgdb = "org.LgriseusR.eg.db",  :
#                          No terms were found in orgdb for BP
#                        Check that the organism and ontology match the ones provided by orgdb
ssayols commented 11 months ago

The input to calculateSimMatrix() should be a vector of GO terms. You're currently calling it with a data.frame that contains multiple columns. Instead, use the column with the GO ID only:

simMatrix2 <- calculateSimMatrix(go_analysis$ID,
                                 orgdb="org.LgriseusR.eg.db",
                                 semdata = GOSemSim::godata(my_new_fancy_orgdb_object, ont="BP",keytype = 'GID'),
                                 method="Rel",
                                 keytype = "GID")

My guess is that this should work. Please let me know otherwise.

tamsinlh commented 11 months ago

@ssayols Yes, thank you, that was an easy fix. Much appreciated.

Additionally, would you have any guidance on getting the treemapPlot to label all, or the majority, of boxes and adjusting max.overlaps for the scatterplot?

ssayols commented 11 months ago

Usually it's a matter of the canvas size, whether there's enough room for more labels. Theoretically, the plotting functions maximize the number of labels displayed based on the space available. You could try increasing the size of the canvas with pdf(file="x.pdf", height=10, width=10) (instead of the default 7x7 inches).

tamsinlh commented 11 months ago

Thank you. That doesn't appear to work for me - it makes the pdf I created unusable. When R plots the treemap, it is very tiny in the window and the .pdf reflects the same size and labeling. I am wondering if there are too many GO terms grouped together - I am needing to set the similarity to 0.9 to be able to see the labels I do have. The example for the package has far fewer boxes than I am seeing so I suspect that is part of the issue for me.

ssayols commented 11 months ago

that's right, treemaps maybe difficult to navigate with too many terms. Alternatively you could try other visualizations, like sunburst plots (currently the package doesn't support them), or simply a scatterplot (volcano visualizing -log(p) vs. FC) of the parent terms only.