lgeistlinger / EnrichmentBrowser

Seamless navigation through combined results of set-based and network-based enrichment analysis
20 stars 11 forks source link

caching recalls wrong object type #32

Closed GabrielHoffman closed 4 years ago

GabrielHoffman commented 4 years ago

Hi, Last bug report for the night, plus some ideas.

It seems that the caching stores the object returned by getGenesets() so if I call it first with return.type="list", any subsequent call will return alisteven ifreturn.type='GeneSetCollection' is used. I can manually set cache=FALSE to get the proper object. But I'd like to call getGenesets() in a package I distribute to others, and I can't take advantage of the great caching ability because it might return an object of the wrong type. See example at the bottom.

While we're talking about caching...have you thought about caching the genesets after running idMap to convert the gene ids? I've found that idMap is very slow but I need to run it in every R session to convert to ENSEMBL ids:

# Convert gene identifiers from the default Entrez to Ensembl
gs.go.ensid = idMap(gs.go, org = "hsa", from = "ENTREZID", to = "ENSEMBL") 

So it would be great if idMap could be called within getGenesets so the object with ENSEMBL names could be cached.

I've been thinking about statistical methods for geneset analyses, and I've found your work really useful. I'm currently working on making a new method more accessible to users, and I'd be happy to share once I get a little further.

Cheers, Gabriel

library(EnrichmentBrowser)

# Load GO for the first time
# Implicitly, return.type is a "list"
res1 = getGenesets(org="hsa", db="go")
#> 
#> Loading required package: org.Hs.eg.db
#> Loading required package: AnnotationDbi
#> 
head(res1)
#> $`GO:0000002_mitochondrial_genome_maintenance`
#>  [1] "10000" "1890"  "291"   "4205"  "4358"  "4976"  "55154" "55186" "80119"
#> [10] "84275" "92667" "9361" 
#> 
#> $`GO:0000003_reproduction`
#> [1] "2796"   "2797"   "286826" "8510"  
#> 
#> $`GO:0000012_single_strand_break_repair`
#> [1] "100133315" "200558"    "23411"     "3981"      "54840"     "55775"    
#> [7] "7141"      "7515"     
#> 
#> $`GO:0000018_regulation_of_DNA_recombination`
#> [1] "10189" "3575"  "3836"  "3838"  "56916" "9984" 
#> 
#> $`GO:0000019_regulation_of_mitotic_recombination`
#> [1] "10111" "2068"  "4361" 
#> 
#> $`GO:0000022_mitotic_spindle_elongation`
#> [1] "9055" "9493"

# Now I want a "GeneSetCollection"
# but the list is already cached
res2 = getGenesets(org="hsa", db="go", return.type='GeneSetCollection')
head(res2)
#> $`GO:0000002_mitochondrial_genome_maintenance`
#>  [1] "10000" "1890"  "291"   "4205"  "4358"  "4976"  "55154" "55186" "80119"
#> [10] "84275" "92667" "9361" 
#> 
#> $`GO:0000003_reproduction`
#> [1] "2796"   "2797"   "286826" "8510"  
#> 
#> $`GO:0000012_single_strand_break_repair`
#> [1] "100133315" "200558"    "23411"     "3981"      "54840"     "55775"    
#> [7] "7141"      "7515"     
#> 
#> $`GO:0000018_regulation_of_DNA_recombination`
#> [1] "10189" "3575"  "3836"  "3838"  "56916" "9984" 
#> 
#> $`GO:0000019_regulation_of_mitotic_recombination`
#> [1] "10111" "2068"  "4361" 
#> 
#> $`GO:0000022_mitotic_spindle_elongation`
#> [1] "9055" "9493"

# I can disable to cache to correctly get 'GeneSetCollection'
res3 = getGenesets(org="hsa", db="go", return.type='GeneSetCollection', cache=FALSE)
res3
#> GeneSetCollection
#>   names: GO0000002, GO0000003, ..., GO2001311 (12233 total)
#>   unique identifiers: 10000, 1890, ..., 55365 (18670 total)
#>   types in collection:
#>     geneIdType: EntrezIdentifier (1 total)
#>     collectionType: GOCollection (1 total)
lgeistlinger commented 4 years ago

Both is possible.

Caching of only one object type representing the gene sets of choice was based on my assumption that users would typically be interested in either working with gene set lists or with GeneSetCollections. But apparently you have a use case where you want to be able to cache both representations?

Integrating obtaining gene sets with ID mapping is also possible, I'll add here an argument gene.id.type to getGenesets.

GabrielHoffman commented 4 years ago

My issue is that I, as a developer, want to use GeneSetCollection. But if I write a package that calls getGenesets, and a user has already called this function and returned a list, then I can't get it to return the right type from cache.

lgeistlinger commented 4 years ago

Understand. The caching part (title of this issue) is resolved via b54da66 corresponding to EnrichmentBrowser v2.19.10. This should now distinguish by return.type when obtaining from cache. You can install directly from github via BiocManager::install("lgeistlinger/EnrichmentBrowser").

I'll make a separate issue for the ID mapping part - that's some more restructuring.

lgeistlinger commented 4 years ago

I've been thinking about statistical methods for geneset analyses, and I've found your work really useful. I'm currently working on making a new method more accessible to users, and I'd be happy to share once I get a little further.

Many thanks for the kind feedback, and I am definitely interested once you'll have something that you would like to share.