jorainer / ensembldb

This is the ensembldb development repository.
https://jorainer.github.io/ensembldb
33 stars 10 forks source link

multiple EnsDb for mouse @ AnnotationHub: which one to use? #147

Closed guidohooiveld closed 1 year ago

guidohooiveld commented 1 year ago

Hi Johannes, I would like to make use of version 109 of the EnsDb for mouse. However, in contrast to v108, I now notice that multiple EnsDb's (records) are available, which confuses me. So which one should I use c.q. is the analogous to the single EnsDb (record) present for v108?

Thanks, Guido

> library(AnnotationHub)
> library(ensembldb)
> library(AnnotationForge)
> 
> ah <- AnnotationHub()
snapshotDate(): 2023-04-06
> 
> ## query for v109 of the mouse EnsDb
> ## note that 16 EnsDbs (records) are available, of which 14 are for Mus musculus...
> query(ah, c("EnsDb", "v109", "Mus musculus"))
AnnotationHub with 16 records
# snapshotDate(): 2023-04-06
# $dataprovider: Ensembl
# $species: Mus musculus, Mus musculus musculus, Mus musculus domesticus, Mu...
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH109640"]]' 

             title                                        
  AH109640 | Ensembl 109 EnsDb for Mus musculus           
  AH109641 | Ensembl 109 EnsDb for Mus musculus           
  AH109642 | Ensembl 109 EnsDb for Mus musculus           
  AH109643 | Ensembl 109 EnsDb for Mus musculus           
  AH109644 | Ensembl 109 EnsDb for Mus musculus           
  ...        ...                                          
  AH109651 | Ensembl 109 EnsDb for Mus musculus           
  AH109652 | Ensembl 109 EnsDb for Mus musculus           
  AH109653 | Ensembl 109 EnsDb for Mus musculus musculus  
  AH109654 | Ensembl 109 EnsDb for Mus musculus domesticus
  AH109655 | Ensembl 109 EnsDb for Mus musculus           
> 
> ## this is in contrast with v108: then only a single EnsDb (record) is present... ??
> query(ah, c("EnsDb", "v108", "Mus musculus"))
AnnotationHub with 1 record
# snapshotDate(): 2023-04-06
# names(): AH109367
# $dataprovider: Ensembl
# $species: Mus musculus
# $rdataclass: EnsDb
# $rdatadateadded: 2022-10-31
# $title: Ensembl 108 EnsDb for Mus musculus
# $description: Gene and protein annotations for Mus musculus based on Ensem...
# $taxonomyid: 10090
# $genome: GRCm39
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("108", "Annotation", "AnnotationHubSoftware", "Coverage",
#   "DataImport", "EnsDb", "Ensembl", "Gene", "Protein", "Sequencing",
#   "Transcript") 
# retrieve record with 'object[["AH109367"]]' 
> 
>
> sessionInfo()
R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Europe/Amsterdam
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] AnnotationForge_1.42.0  ensembldb_2.24.0        AnnotationFilter_1.24.0
 [4] GenomicFeatures_1.52.0  AnnotationDbi_1.62.1    Biobase_2.60.0         
 [7] GenomicRanges_1.52.0    GenomeInfoDb_1.36.0     IRanges_2.34.0         
[10] S4Vectors_0.38.1        AnnotationHub_3.8.0     BiocFileCache_2.8.0    
[13] dbplyr_2.3.2            BiocGenerics_0.46.0    

<<snip>>
guidohooiveld commented 1 year ago

Aha, I think I got it.

Apparently since v109 of Ensembl the genomes of 15 different mouse strains, in addition to the Genome Reference Consortium builds, are being annotated. https://www.ensembl.org/Mus_musculus/Info/Strains?db=core

Since I mapped my RNA-seq data on the latest GENCODE annotation and reference files (Release M32 (GRCm39)) I have to make use of the EnsDb of the reference strain; that is $genome: GRCm39. This turns out to be the 16th EnsDb (i.e. AH109655). In contrast, the first EnsDb (AH109640) is for Mouse 129S1/SvImJ (because $genome: 129S1_SvImJ_v1). Etc.

Therefore, to refine my question, is it also possible to include in the query the genome assembly (thus on $genome)?

> query(ah, c("EnsDb", "v109", "Mus musculus"))[16]
AnnotationHub with 1 record
# snapshotDate(): 2023-04-06
# names(): AH109655
# $dataprovider: Ensembl
# $species: Mus musculus
# $rdataclass: EnsDb
# $rdatadateadded: 2022-10-30
# $title: Ensembl 109 EnsDb for Mus musculus
# $description: Gene and protein annotations for Mus musculus based on Ensem...
# $taxonomyid: 10090
# $genome: GRCm39
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("109", "Annotation", "AnnotationHubSoftware", "Coverage",
#   "DataImport", "EnsDb", "Ensembl", "Gene", "Protein", "Sequencing",
#   "Transcript") 
# retrieve record with 'object[["AH109655"]]' 
> 
> query(ah, c("EnsDb", "v109", "Mus musculus"))[1]
AnnotationHub with 1 record
# snapshotDate(): 2023-04-06
# names(): AH109640
# $dataprovider: Ensembl
# $species: Mus musculus
# $rdataclass: EnsDb
# $rdatadateadded: 2022-10-30
# $title: Ensembl 109 EnsDb for Mus musculus
# $description: Gene and protein annotations for Mus musculus based on Ensem...
# $taxonomyid: 10090
# $genome: 129S1_SvImJ_v1
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("109", "Annotation", "AnnotationHubSoftware", "Coverage",
#   "DataImport", "EnsDb", "Ensembl", "Gene", "Protein", "Sequencing",
#   "Transcript") 
# retrieve record with 'object[["AH109640"]]' 
> 
jorainer commented 1 year ago

Hi Guido!

yes, sorry, I now create also EnsDbs for all strains - there have been requests for that - up to now I simply dropped them. Regarding the query, this is actually a function from AnnotationHub, not ensembldb, so I don't have any control over that function. But maybe that might be a nice issue/feature request for AnnotationHub itself?

Note also that by adding the genome in the query call you should get what you want (I suppose?):

> query(ah, c("EnsDb", "v109", "Mus musculus", "GRCm39"))
AnnotationHub with 1 record
# snapshotDate(): 2023-05-15
# names(): AH109655
# $dataprovider: Ensembl
# $species: Mus musculus
# $rdataclass: EnsDb
# $rdatadateadded: 2022-10-30
# $title: Ensembl 109 EnsDb for Mus musculus
# $description: Gene and protein annotations for Mus musculus based on Ensem...
# $taxonomyid: 10090
# $genome: GRCm39
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("109", "Annotation", "AnnotationHubSoftware", "Coverage",
#   "DataImport", "EnsDb", "Ensembl", "Gene", "Protein", "Sequencing",
#   "Transcript") 
# retrieve record with 'object[["AH109655"]]' 

I don't know how exactly query works, but I assume its combining the search terms with a & - and searching in any fields.

guidohooiveld commented 1 year ago

Yep, including the search term "GRCm39" indeed easily allowed to find that specific EnsDb. Why did I not think of that myself... Thanks Jo!