Closed manogenome closed 2 years ago
Thanks - I will have a look into this.
For the circular chromosomes - I did not yet find a way to extract this information from the Ensembl Perl API, thus, in all EnsDb
all chromosomes have isCircular = FALSE
. I'll have another look into that - maybe I find this information somewhere.
Ah, actually - I'm fetching this information using the Perl API (see also issue #86). I'll next have a look at the problems in changing seqlevel styles...
Regarding the mapping of contigs, patches etc - there's not much that can be done unfortunately. AFAIK Ensembl has their own pipeline to define and build these sequences and this will most likely result in results that are not directly comparable/mappable to those of UCSC or NCBI. The chromosomes are defined and should be mappable across databases - but for the rest I would not guarantee that this is possible at all.
Note that the mapping of the seqlevels (chromosomes) is based on the GenomeInfoDb
Bioconductor package and in fact this package provides only the mapping for the core chromosomes (see here).
So, summarizing, I don't think we can provide a complete mapping between annotation resources. IMHO it is also not wise to do so - for one analysis it's better to stay within one annotation ecosystem (USCS, NCBI or Ensembl) but not to switch.
Thank you, Johannes. Indeed, I agree with you regarding the seqlevelsStyle
. For now, I've limited my analysis to standard chromosomes so that I can switch seqlevelsStyle for visualization. As for the chromosome circularity, I did a workaround by setting the value during the analysis as isCircular(edb)["MT"] <- TRUE
during analysis. I'm wondering if it's possible to set the tag value for the Mictochnodia and Chloroplast as TRUE
for eukaryotic genomes without having to depend on the API.
I'll look into that. My worry is that the mitochondrial chromosome does not always have the same name in all species (not sure about that though).
I can add a quick fix like your's, i.e. that manually sets isCircular = TRUE
for certain chromosomes. At present I'll check for chromosomes named "MT"
. Fix will be added after BioC 3.15 is released (tomorrow).
The fix is now commited to BioC 3.15 as well as BioC devel. Might take some days until the updated packages are available through BiocManager::install("ensembldb")
.
Closing the issue now - feel free to re-open if needed.
Thanks, Johannes. I'll check the updated packages.
Hi Johannes,
The mitochondrial chromosome is circular, but it's tagged as linear in the ensembldb object while retrieving
seqinfo
. This is causing errors when retrieving sequences from the BSgenome object. Also, I notice that sequence style conversion between NCBI and UCSC formats doesn't work for the edb seqinfo object. Below, I've document the issue for your reference.Load packages
Get seqinfo
Check seqlevel styles
For both edb and grch38, the sequence level style is NCBI.
Mitochondrial chromosome
For the mitochondria,
isCircular
value is set asFALSE
in edb andTRUE
in grch38. This needs to fixed in edb as we know the mitochondrial DNA is circular. This creates an issue when we have a GRanges object from edb and want to extract corresponding sequence from GRCh38.Seqnames overlap
Even though both the objects are in NCBI format, chromosome names overlap doesn't extend to contigs.
Switching seq style
For the edb, we note that switching sequence style to UCSC does not work for all the sequences. But for GRCh38, sequence style conversion between NCBI and UCSC works without any issue.
Created on 2022-04-13 by the reprex package (v2.0.1)
Session info
``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.1.2 (2021-11-01) #> os Ubuntu 18.04.6 LTS #> system x86_64, linux-gnu #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Brussels #> date 2022-04-13 #> pandoc 2.11.4 @ /usr/lib/rstudio-server/bin/pandoc/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> AnnotationDbi * 1.56.2 2021-11-09 [1] Bioconductor #> AnnotationFilter * 1.18.0 2021-10-26 [1] Bioconductor #> AnnotationHub * 3.2.2 2022-03-01 [1] Bioconductor #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.1) #> Biobase * 2.54.0 2021-10-26 [1] Bioconductor #> BiocFileCache * 2.2.1 2022-01-23 [1] Bioconductor #> BiocGenerics * 0.40.0 2021-10-26 [1] Bioconductor #> BiocIO 1.4.0 2021-10-26 [1] Bioconductor #> BiocManager 1.30.16 2021-06-15 [1] CRAN (R 4.1.1) #> BiocParallel 1.28.3 2021-12-09 [1] Bioconductor #> BiocVersion 3.14.0 2021-05-19 [1] Bioconductor #> biomaRt 2.50.3 2022-02-03 [1] Bioconductor #> Biostrings * 2.62.0 2021-10-26 [1] Bioconductor #> bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.1) #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.1) #> bitops 1.0-7 2021-04-24 [1] CRAN (R 4.1.1) #> blob 1.2.2 2021-07-23 [1] CRAN (R 4.1.1) #> BSgenome * 1.62.0 2021-10-26 [1] Bioconductor #> BSgenome.Hsapiens.NCBI.GRCh38 * 1.3.1000 2022-03-23 [1] Bioconductor #> cachem 1.0.6 2021-08-19 [1] CRAN (R 4.1.1) #> cli 3.2.0 2022-02-14 [1] CRAN (R 4.1.2) #> crayon 1.5.0 2022-02-14 [1] CRAN (R 4.1.2) #> curl 4.3.2 2021-06-23 [1] CRAN (R 4.1.1) #> DBI 1.1.2 2021-12-20 [1] CRAN (R 4.1.1) #> dbplyr * 2.1.1 2021-04-06 [1] CRAN (R 4.1.1) #> DelayedArray 0.20.0 2021-10-26 [1] Bioconductor #> digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.1) #> dplyr 1.0.8 2022-02-08 [1] CRAN (R 4.1.2) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.1) #> ensembldb * 2.19.8 2022-02-04 [1] Github (jorainer/ensembldb@423b35d) #> evaluate 0.15 2022-02-18 [1] CRAN (R 4.1.2) #> fansi 1.0.2 2022-01-14 [1] CRAN (R 4.1.2) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.1) #> filelock 1.0.2 2018-10-05 [1] CRAN (R 4.1.1) #> fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.1) #> generics 0.1.2 2022-01-31 [1] CRAN (R 4.1.2) #> GenomeInfoDb * 1.30.1 2022-01-30 [1] Bioconductor #> GenomeInfoDbData 1.2.7 2021-09-28 [1] Bioconductor #> GenomicAlignments 1.30.0 2021-10-26 [1] Bioconductor #> GenomicFeatures * 1.46.5 2022-02-27 [1] Bioconductor #> GenomicRanges * 1.46.1 2021-11-18 [1] Bioconductor #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.1.2) #> hms 1.1.1 2021-09-26 [1] CRAN (R 4.1.1) #> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.1) #> httpuv 1.6.5 2022-01-05 [1] CRAN (R 4.1.1) #> httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.1) #> interactiveDisplayBase 1.32.0 2021-10-26 [1] Bioconductor #> IRanges * 2.28.0 2021-10-26 [1] Bioconductor #> KEGGREST 1.34.0 2021-10-26 [1] Bioconductor #> knitr 1.37 2021-12-16 [1] CRAN (R 4.1.1) #> later 1.3.0 2021-08-18 [1] CRAN (R 4.1.1) #> lattice 0.20-45 2021-09-22 [4] CRAN (R 4.1.1) #> lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.1.1) #> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.1) #> magrittr 2.0.2 2022-01-26 [1] CRAN (R 4.1.2) #> Matrix 1.4-0 2021-12-08 [4] CRAN (R 4.1.2) #> MatrixGenerics 1.6.0 2021-10-26 [1] Bioconductor #> matrixStats 0.61.0 2021-09-17 [1] CRAN (R 4.1.1) #> memoise 2.0.1 2021-11-26 [1] CRAN (R 4.1.1) #> mime 0.12 2021-09-28 [1] CRAN (R 4.1.1) #> pillar 1.7.0 2022-02-01 [1] CRAN (R 4.1.2) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.1) #> png 0.1-7 2013-12-03 [1] CRAN (R 4.1.1) #> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.1) #> progress 1.2.2 2019-05-16 [1] CRAN (R 4.1.1) #> promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.1.1) #> ProtGenerics 1.27.2 2022-01-12 [1] Github (RforMassSpectrometry/ProtGenerics@10880dd) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.1) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.1) #> rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.1.1) #> Rcpp 1.0.8.3 2022-03-17 [1] CRAN (R 4.1.2) #> RCurl 1.98-1.6 2022-02-08 [1] CRAN (R 4.1.2) #> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.1) #> restfulr 0.0.13 2017-08-06 [1] CRAN (R 4.1.1) #> rjson 0.2.21 2022-01-09 [1] CRAN (R 4.1.1) #> rlang 1.0.2 2022-03-04 [1] CRAN (R 4.1.2) #> rmarkdown 2.13 2022-03-10 [1] CRAN (R 4.1.2) #> Rsamtools 2.10.0 2021-10-26 [1] Bioconductor #> RSQLite 2.2.11 2022-03-23 [1] CRAN (R 4.1.2) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.1) #> rtracklayer * 1.54.0 2021-10-26 [1] Bioconductor #> S4Vectors * 0.32.3 2021-11-21 [1] Bioconductor #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.1) #> shiny 1.7.1 2021-10-02 [1] CRAN (R 4.1.1) #> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.1) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.1) #> SummarizedExperiment 1.24.0 2021-10-26 [1] Bioconductor #> tibble 3.1.6 2021-11-07 [1] CRAN (R 4.1.1) #> tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.1.2) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.1) #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.1) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.1.2) #> xfun 0.30 2022-03-02 [1] CRAN (R 4.1.2) #> XML 3.99-0.9 2022-02-24 [1] CRAN (R 4.1.2) #> xml2 1.3.3 2021-11-30 [1] CRAN (R 4.1.1) #> xtable 1.8-4 2019-04-21 [1] CRAN (R 4.1.1) #> XVector * 0.34.0 2021-10-26 [1] Bioconductor #> yaml 2.3.5 2022-02-21 [1] CRAN (R 4.1.2) #> zlibbioc 1.40.0 2021-10-26 [1] Bioconductor #> #> [1] /home/manoj/R/x86_64-pc-linux-gnu-library/4.1 #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library #> #> ────────────────────────────────────────────────────────────────────────────── ```