lgeistlinger / EnrichmentBrowser

Seamless navigation through combined results of set-based and network-based enrichment analysis
20 stars 11 forks source link

eaBrowse error with S. cerevisiae org.db #18

Closed matrs closed 5 years ago

matrs commented 5 years ago

Hello, First, thanks for your package, it's been really useful for me. I've been working with RNA-seq from Saccharomyces cerevisiae. I had a pipeline based on edgeR so I take the results from it and made a SummarizedExperiment object as specified in the EnrichmentBrowser manual.

> se_test
class: SummarizedExperiment 
dim: 5584 6 
metadata(2): annotation dataType
assays(1): ''
rownames(5584): 851262 851261 ... 856333 856334
rowData names(2): FC ADJ.PVAL
colnames(6): SRR3396381_rep1 SRR3396382_rep2 ... SRR3396386_rep2 SRR3396387_rep3
colData names(1): GROUP

When I run

sbea_test <- sbea(method="ora", se=se_test, gs=go.gs, perm=0, alpha=0.05, 
                           padj.method = "BH")

Everything appears to be OK. When I look at the results, I got:

gsRanking(sbea_test)
DataFrame with 28 rows and 5 columns
                                                                                                                                       GENE.SET
                                                                                                                                    <character>
1                                                                                                                GO:0042254_ribosome_biogenesis
2                                                                                                                    GO:0006364_rRNA_processing
3                                           GO:0000463_maturation_of_LSU-rRNA_from_tricistronic_rRNA_transcript_(SSU-rRNA,_5.8S_rRNA,_LSU-rRNA)
4                                GO:0000472_endonucleolytic_cleavage_to_generate_mature_5'-end_of_SSU-rRNA_from_(SSU-rRNA,_5.8S_rRNA,_LSU-rRNA)
5                                                                                           GO:0008652_cellular_amino_acid_biosynthetic_process
...                                                                                                                                         ...
     NR.GENES NR.SIG.GENES      PVAL           ADJ.PVAL
    <numeric>    <numeric> <numeric>          <numeric>
1         198          156  2.73e-32        2.68359e-29
2         205          155  1.96e-28         9.6334e-26
3          32           30  5.86e-11       1.440095e-08
4          32           30  5.86e-11       1.440095e-08
5          96           67  3.17e-10        6.23222e-08
...       ...          ...       ...                ...

But when I try eaBrowse, it fails:

> eaBrowse(sbea_test, out.dir = "enrichmentbrowser_se-test", html.only=TRUE)
Creating gene report ...
Error in .testForValidCols(x, cols) : 
  Invalid columns: SYMBOL. Please use the columns method to see a listing of valid arguments.

So I suppose this error is related to the fact that the "org" S. cerevisiae db, org.Sc.sgd.db, doesn't have a column for SYMBOL

> columns(org.Sc.sgd.db)
 [1] "ALIAS"        "COMMON"       "DESCRIPTION"  "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"     "ENZYME"      
 [9] "EVIDENCE"     "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"        "INTERPRO"     "ONTOLOGY"     "ONTOLOGYALL" 
[17] "ORF"          "PATH"         "PFAM"         "PMID"         "REFSEQ"       "SGD"          "SMART"        "UNIPROT"     

Is this needed to build the links between "gene names/symbols" and the pages at NCBI showed in the volcano plots? If that's the case, could it be possible to use the already available "entrez ids" to do this?

lgeistlinger commented 5 years ago

Thanks a lot for reporting this. Apparently, org.Sc.sgd.db is the only org.db package that doesn't contain a SYMBOL column.

For org.Sc.sgd.db the gene symbol seems to be rather stored in the GENENAME column

> AnnotationDbi::mapIds(org.Sc.sgd.db, keytype="ENTREZID", keys=c("853972","851092","855323"), column="GENENAME")
'select()' returned 1:1 mapping between keys and columns
853972 851092 855323 
"PCK1" "FBP1" "CAT8"

and the gene name in the DESCRIPTION column

> AnnotationDbi::mapIds(org.Sc.sgd.db, keytype="ENTREZID", keys=c("853972","851092","855323"), column="DESCRIPTION")
'select()' returned 1:1 mapping between keys and columns
                                                                                                                                                                                                                                                                                                           853972 
                                                                "Phosphoenolpyruvate carboxykinase; key enzyme in gluconeogenesis, catalyzes early reaction in carbohydrate biosynthesis, glucose represses transcription and accelerates mRNA degradation, regulated by Mcm1p and Cat8p, located in the cytosol" 
                                                                                                                                                                                                                                                                                                           851092 
"Fructose-1,6-bisphosphatase; key regulatory enzyme in the gluconeogenesis pathway, required for glucose metabolism; undergoes either proteasome-mediated or autophagy-mediated degradation depending on growth conditions; glucose starvation results in redistribution to the periplasm; interacts with Vid30p" 
                                                                                                                                                                                                                                                                                                           855323 
                             "Zinc cluster transcriptional activator; necessary for derepression of a variety of genes under non-fermentative growth conditions, active after diauxic shift, binds carbon source responsive elements; relative distribution to the nucleus increases upon DNA replication stress"

I will ask on the Bioc-devel mailing list where this inconsistency comes from.

As a work-around, you can do the following:

configEBrowser("SYM.COL", "GENENAME")
configEBrowser("GN.COL", "DESCRIPTION")

and try running

eaBrowse(sbea_test)

again.

You can restore the default configuration settings via

configEBrowser()

Note that I just introduced access to these columns, so you will need the latest release version of EnrichmentBrowser (v2.14.2) which will become available under http://bioconductor.org/packages/EnrichmentBrowser within 24 hours.

If you want to try it out right away you can install the devel version (v2.15.6) directly from github via:

BiocManager::install("lgeistlinger/EnrichmentBrowser")
matrs commented 5 years ago

Hey, thank you for your quick answer, I'll try it and I'll let you known

About the missing SYMBOL column, maybe It's because there are too many missing mappings between any organizations' IDs and HUGO names?

> entrez2genename <- (AnnotationDbi::mapIds(org.Sc.sgd.db, keytype="ENTREZID", keys=rownames(se_test), column="GENENAME"))
'select()' returned 1:1 mapping between keys and columns

> sum(is.na(entrez2genename))
[1] 603
lgeistlinger commented 5 years ago

I think the issue comes from the fact that org.Sc.sgd.db is created based on data from the Saccharomyces Genome Database (SGD), as opposed to other org.db packages such as org.Hs.eg.db that are created based on data from the NCBI.

And for some reason (likely historical), the SGD uses gene name for what is typically referred to by gene symbol (see eg the SGD entry for PCK1).

lgeistlinger commented 5 years ago

I am closing this. Feel free to reopen if you find the suggested work-around to not work for your purposes.