jorainer / ensembldb

This is the ensembldb development repository.
https://jorainer.github.io/ensembldb
33 stars 10 forks source link

Mitochondrial chromosome is circular but isCircular(edb)["MT"] value is set as FALSE in the ensembldb object #133

Closed manogenome closed 2 years ago

manogenome commented 2 years ago

Hi Johannes,

The mitochondrial chromosome is circular, but it's tagged as linear in the ensembldb object while retrieving seqinfo. This is causing errors when retrieving sequences from the BSgenome object. Also, I notice that sequence style conversion between NCBI and UCSC formats doesn't work for the edb seqinfo object. Below, I've document the issue for your reference.

Load packages

library(ensembldb)
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
#>     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
#>     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
#>     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
#>     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
#>     union, unique, unsplit, which.max, which.min
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:base':
#> 
#>     expand.grid, I, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: GenomicFeatures
#> Loading required package: AnnotationDbi
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> Loading required package: AnnotationFilter
#> 
#> Attaching package: 'ensembldb'
#> The following object is masked from 'package:stats':
#> 
#>     filter
library(AnnotationHub)
#> Loading required package: BiocFileCache
#> Loading required package: dbplyr
#> 
#> Attaching package: 'AnnotationHub'
#> The following object is masked from 'package:Biobase':
#> 
#>     cache
library(BSgenome.Hsapiens.NCBI.GRCh38)
#> Loading required package: BSgenome
#> Loading required package: Biostrings
#> Loading required package: XVector
#> 
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#> 
#>     strsplit
#> Loading required package: rtracklayer
#> 
#> Attaching package: 'rtracklayer'
#> The following object is masked from 'package:AnnotationHub':
#> 
#>     hubUrl
grch38 <- BSgenome.Hsapiens.NCBI.GRCh38
ah <- AnnotationHub()
#> snapshotDate(): 2021-10-20
qr <- query(ah, c("EnsDb", "sapiens", "104"))
edb <- qr[[1]]
#> loading from cache
edb
#> EnsDb for Ensembl:
#> |Backend: SQLite
#> |Db type: EnsDb
#> |Type of Gene ID: Ensembl Gene ID
#> |Supporting package: ensembldb
#> |Db created by: ensembldb package from Bioconductor
#> |script_version: 0.3.7
#> |Creation time: Sat Oct 30 20:42:32 2021
#> |ensembl_version: 104
#> |ensembl_host: localhost
#> |Organism: Homo sapiens
#> |taxonomy_id: 9606
#> |genome_build: GRCh38
#> |DBSCHEMAVERSION: 2.2
#> | No. of genes: 67990.
#> | No. of transcripts: 259749.
#> |Protein data available.

Get seqinfo

seqinfo_edb <- seqinfo(edb)
seqinfo_h38 <- seqinfo(grch38)

Check seqlevel styles

For both edb and grch38, the sequence level style is NCBI.

seqlevelsStyle(seqinfo_edb)
#> [1] "NCBI"
seqlevelsStyle(seqinfo_h38)
#> [1] "NCBI"

Mitochondrial chromosome

For the mitochondria, isCircular value is set as FALSE in edb and TRUE in grch38. This needs to fixed in edb as we know the mitochondrial DNA is circular. This creates an issue when we have a GRanges object from edb and want to extract corresponding sequence from GRCh38.

seqinfo_edb["MT"]
#> Seqinfo object with 1 sequence from GRCh38 genome:
#>   seqnames seqlengths isCircular genome
#>   MT            16569      FALSE GRCh38
seqinfo_h38["MT"]
#> Seqinfo object with 1 sequence (1 circular) from GRCh38 genome:
#>   seqnames seqlengths isCircular genome
#>   MT            16569       TRUE GRCh38

Seqnames overlap

Even though both the objects are in NCBI format, chromosome names overlap doesn't extend to contigs.

seqnames_edb <- seqnames(seqinfo_edb)
seqnames_h38 <- seqnames(seqinfo_h38)

length(seqnames_edb)
#> [1] 455
length(seqnames_h38)
#> [1] 455

intersect(seqnames_edb, seqnames_h38)
#>  [1] "X"  "20" "1"  "6"  "3"  "7"  "12" "11" "4"  "17" "2"  "16" "8"  "19" "9" 
#> [16] "13" "14" "5"  "22" "10" "Y"  "18" "15" "21" "MT"

Switching seq style

For the edb, we note that switching sequence style to UCSC does not work for all the sequences. But for GRCh38, sequence style conversion between NCBI and UCSC works without any issue.

seqlevelsStyle(seqinfo_edb) <- "UCSC"
#> Warning in (function (seqlevels, genome, new_style) : cannot switch some of
#> GRCh38's seqlevels from NCBI to UCSC style
seqlevelsStyle(seqinfo_h38) <- "UCSC"

#seqnames(seqinfo_edb)
#seqnames(seqinfo_h38)

seqlevelsStyle(seqinfo_edb) <- "NCBI"
seqlevelsStyle(seqinfo_h38) <- "NCBI"

Created on 2022-04-13 by the reprex package (v2.0.1)

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.1.2 (2021-11-01) #> os Ubuntu 18.04.6 LTS #> system x86_64, linux-gnu #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Brussels #> date 2022-04-13 #> pandoc 2.11.4 @ /usr/lib/rstudio-server/bin/pandoc/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> AnnotationDbi * 1.56.2 2021-11-09 [1] Bioconductor #> AnnotationFilter * 1.18.0 2021-10-26 [1] Bioconductor #> AnnotationHub * 3.2.2 2022-03-01 [1] Bioconductor #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.1) #> Biobase * 2.54.0 2021-10-26 [1] Bioconductor #> BiocFileCache * 2.2.1 2022-01-23 [1] Bioconductor #> BiocGenerics * 0.40.0 2021-10-26 [1] Bioconductor #> BiocIO 1.4.0 2021-10-26 [1] Bioconductor #> BiocManager 1.30.16 2021-06-15 [1] CRAN (R 4.1.1) #> BiocParallel 1.28.3 2021-12-09 [1] Bioconductor #> BiocVersion 3.14.0 2021-05-19 [1] Bioconductor #> biomaRt 2.50.3 2022-02-03 [1] Bioconductor #> Biostrings * 2.62.0 2021-10-26 [1] Bioconductor #> bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.1) #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.1) #> bitops 1.0-7 2021-04-24 [1] CRAN (R 4.1.1) #> blob 1.2.2 2021-07-23 [1] CRAN (R 4.1.1) #> BSgenome * 1.62.0 2021-10-26 [1] Bioconductor #> BSgenome.Hsapiens.NCBI.GRCh38 * 1.3.1000 2022-03-23 [1] Bioconductor #> cachem 1.0.6 2021-08-19 [1] CRAN (R 4.1.1) #> cli 3.2.0 2022-02-14 [1] CRAN (R 4.1.2) #> crayon 1.5.0 2022-02-14 [1] CRAN (R 4.1.2) #> curl 4.3.2 2021-06-23 [1] CRAN (R 4.1.1) #> DBI 1.1.2 2021-12-20 [1] CRAN (R 4.1.1) #> dbplyr * 2.1.1 2021-04-06 [1] CRAN (R 4.1.1) #> DelayedArray 0.20.0 2021-10-26 [1] Bioconductor #> digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.1) #> dplyr 1.0.8 2022-02-08 [1] CRAN (R 4.1.2) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.1) #> ensembldb * 2.19.8 2022-02-04 [1] Github (jorainer/ensembldb@423b35d) #> evaluate 0.15 2022-02-18 [1] CRAN (R 4.1.2) #> fansi 1.0.2 2022-01-14 [1] CRAN (R 4.1.2) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.1) #> filelock 1.0.2 2018-10-05 [1] CRAN (R 4.1.1) #> fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.1) #> generics 0.1.2 2022-01-31 [1] CRAN (R 4.1.2) #> GenomeInfoDb * 1.30.1 2022-01-30 [1] Bioconductor #> GenomeInfoDbData 1.2.7 2021-09-28 [1] Bioconductor #> GenomicAlignments 1.30.0 2021-10-26 [1] Bioconductor #> GenomicFeatures * 1.46.5 2022-02-27 [1] Bioconductor #> GenomicRanges * 1.46.1 2021-11-18 [1] Bioconductor #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.1.2) #> hms 1.1.1 2021-09-26 [1] CRAN (R 4.1.1) #> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.1) #> httpuv 1.6.5 2022-01-05 [1] CRAN (R 4.1.1) #> httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.1) #> interactiveDisplayBase 1.32.0 2021-10-26 [1] Bioconductor #> IRanges * 2.28.0 2021-10-26 [1] Bioconductor #> KEGGREST 1.34.0 2021-10-26 [1] Bioconductor #> knitr 1.37 2021-12-16 [1] CRAN (R 4.1.1) #> later 1.3.0 2021-08-18 [1] CRAN (R 4.1.1) #> lattice 0.20-45 2021-09-22 [4] CRAN (R 4.1.1) #> lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.1.1) #> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.1) #> magrittr 2.0.2 2022-01-26 [1] CRAN (R 4.1.2) #> Matrix 1.4-0 2021-12-08 [4] CRAN (R 4.1.2) #> MatrixGenerics 1.6.0 2021-10-26 [1] Bioconductor #> matrixStats 0.61.0 2021-09-17 [1] CRAN (R 4.1.1) #> memoise 2.0.1 2021-11-26 [1] CRAN (R 4.1.1) #> mime 0.12 2021-09-28 [1] CRAN (R 4.1.1) #> pillar 1.7.0 2022-02-01 [1] CRAN (R 4.1.2) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.1) #> png 0.1-7 2013-12-03 [1] CRAN (R 4.1.1) #> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.1) #> progress 1.2.2 2019-05-16 [1] CRAN (R 4.1.1) #> promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.1.1) #> ProtGenerics 1.27.2 2022-01-12 [1] Github (RforMassSpectrometry/ProtGenerics@10880dd) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.1) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.1) #> rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.1.1) #> Rcpp 1.0.8.3 2022-03-17 [1] CRAN (R 4.1.2) #> RCurl 1.98-1.6 2022-02-08 [1] CRAN (R 4.1.2) #> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.1) #> restfulr 0.0.13 2017-08-06 [1] CRAN (R 4.1.1) #> rjson 0.2.21 2022-01-09 [1] CRAN (R 4.1.1) #> rlang 1.0.2 2022-03-04 [1] CRAN (R 4.1.2) #> rmarkdown 2.13 2022-03-10 [1] CRAN (R 4.1.2) #> Rsamtools 2.10.0 2021-10-26 [1] Bioconductor #> RSQLite 2.2.11 2022-03-23 [1] CRAN (R 4.1.2) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.1) #> rtracklayer * 1.54.0 2021-10-26 [1] Bioconductor #> S4Vectors * 0.32.3 2021-11-21 [1] Bioconductor #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.1) #> shiny 1.7.1 2021-10-02 [1] CRAN (R 4.1.1) #> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.1) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.1) #> SummarizedExperiment 1.24.0 2021-10-26 [1] Bioconductor #> tibble 3.1.6 2021-11-07 [1] CRAN (R 4.1.1) #> tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.1.2) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.1) #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.1) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.1.2) #> xfun 0.30 2022-03-02 [1] CRAN (R 4.1.2) #> XML 3.99-0.9 2022-02-24 [1] CRAN (R 4.1.2) #> xml2 1.3.3 2021-11-30 [1] CRAN (R 4.1.1) #> xtable 1.8-4 2019-04-21 [1] CRAN (R 4.1.1) #> XVector * 0.34.0 2021-10-26 [1] Bioconductor #> yaml 2.3.5 2022-02-21 [1] CRAN (R 4.1.2) #> zlibbioc 1.40.0 2021-10-26 [1] Bioconductor #> #> [1] /home/manoj/R/x86_64-pc-linux-gnu-library/4.1 #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
jorainer commented 2 years ago

Thanks - I will have a look into this.

For the circular chromosomes - I did not yet find a way to extract this information from the Ensembl Perl API, thus, in all EnsDb all chromosomes have isCircular = FALSE. I'll have another look into that - maybe I find this information somewhere.

jorainer commented 2 years ago

Ah, actually - I'm fetching this information using the Perl API (see also issue #86). I'll next have a look at the problems in changing seqlevel styles...

jorainer commented 2 years ago

Regarding the mapping of contigs, patches etc - there's not much that can be done unfortunately. AFAIK Ensembl has their own pipeline to define and build these sequences and this will most likely result in results that are not directly comparable/mappable to those of UCSC or NCBI. The chromosomes are defined and should be mappable across databases - but for the rest I would not guarantee that this is possible at all.

Note that the mapping of the seqlevels (chromosomes) is based on the GenomeInfoDb Bioconductor package and in fact this package provides only the mapping for the core chromosomes (see here).

So, summarizing, I don't think we can provide a complete mapping between annotation resources. IMHO it is also not wise to do so - for one analysis it's better to stay within one annotation ecosystem (USCS, NCBI or Ensembl) but not to switch.

manogenome commented 2 years ago

Thank you, Johannes. Indeed, I agree with you regarding the seqlevelsStyle. For now, I've limited my analysis to standard chromosomes so that I can switch seqlevelsStyle for visualization. As for the chromosome circularity, I did a workaround by setting the value during the analysis as isCircular(edb)["MT"] <- TRUE during analysis. I'm wondering if it's possible to set the tag value for the Mictochnodia and Chloroplast as TRUE for eukaryotic genomes without having to depend on the API.

jorainer commented 2 years ago

I'll look into that. My worry is that the mitochondrial chromosome does not always have the same name in all species (not sure about that though).

jorainer commented 2 years ago

I can add a quick fix like your's, i.e. that manually sets isCircular = TRUE for certain chromosomes. At present I'll check for chromosomes named "MT". Fix will be added after BioC 3.15 is released (tomorrow).

jorainer commented 2 years ago

The fix is now commited to BioC 3.15 as well as BioC devel. Might take some days until the updated packages are available through BiocManager::install("ensembldb").

Closing the issue now - feel free to re-open if needed.

manogenome commented 2 years ago

Thanks, Johannes. I'll check the updated packages.