jorainer / ensembldb

This is the ensembldb development repository.
https://jorainer.github.io/ensembldb
33 stars 10 forks source link

genomeToTranscript not mapping certain ranges (including entire chromosomes) #148

Closed cjagla closed 10 months ago

cjagla commented 10 months ago

Hi, I'm trying to use genomeToTranscript() to convert exon coordinates from genomic to transcript-based for specific human genes. I can't get any examples of genomic coordinates to work other than the one you've provided in the function manual, including a very large chunk of chromosome 6 that should definitely contain many genes or the entirety of chromosome 19. Can you please assist?

`> a[["9606"]] EnsDb for Ensembl: |Backend: SQLite |Db type: EnsDb |Type of Gene ID: Ensembl Gene ID |Supporting package: ensembldb |Db created by: ensembldb package from Bioconductor |script_version: 0.3.10 |Creation time: Thu Feb 16 12:36:05 2023 |ensembl_version: 109 |ensembl_host: localhost |Organism: Homo sapiens |taxonomy_id: 9606 |genome_build: GRCh38 |DBSCHEMAVERSION: 2.2 |common_name: human |species: homo_sapiens | No. of genes: 70623. | No. of transcripts: 276218. |Protein data available.

gnm <- GRanges("6:88109864-88166347") gnm_tx <- genomeToTranscript(gnm, a[["9606"]]) Warning message: 1 genomic region(s) could not be mapped to a transcript; hint: see ?seqlevelsStyle if you used UCSC chromosome names gnm_tx IRangesList object of length 1: [[1]] IRanges object with 1 range and 7 metadata columns: start end width | tx_id exon_id exon_rank seq_start seq_end seq_name seq_strand

| [1] -1 -1 1 | 88109864 88166347 6 * gnm <- GRanges("6:1-88167000") gnm_tx <- genomeToTranscript(gnm, a[["9606"]]) Warning message: 1 genomic region(s) could not be mapped to a transcript; hint: see ?seqlevelsStyle if you used UCSC chromosome names gnm_tx IRangesList object of length 1: [[1]] IRanges object with 1 range and 7 metadata columns: start end width | tx_id exon_id exon_rank seq_start seq_end seq_name seq_strand | [1] -1 -1 1 | 1 88167000 6 * gnm <- GRanges("X:107716399-107716401") gnm_tx <- genomeToTranscript(gnm, a[["9606"]]) gnm_tx IRangesList object of length 1: [[1]] IRanges object with 2 ranges and 7 metadata columns: start end width | tx_id exon_id exon_rank seq_start seq_end seq_name seq_strand | ENST00000372390 145 147 3 | ENST00000372390 ENSE00001457675 1 107716399 107716401 X * ENST00000486554 1 3 3 | ENST00000486554 ENSE00001927337 1 107716399 107716401 X * sessionInfo() R version 4.2.2 (2022-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: AlmaLinux 9.2 (Turquoise Kodkod)

Matrix products: default BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.21.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] ensembldb_2.22.0 AnnotationFilter_1.22.0 GenomicFeatures_1.50.4 AnnotationDbi_1.60.2 Biobase_2.58.0 GenomicRanges_1.50.2
[7] GenomeInfoDb_1.34.9 IRanges_2.32.0 S4Vectors_0.36.2 AnnotationHub_3.6.0 BiocFileCache_2.6.1 dbplyr_2.3.3
[13] BiocGenerics_0.44.0 stringr_1.5.0 dplyr_1.1.2 jsonlite_1.8.7 readr_2.1.4 tidyr_1.3.0
[19] xml2_1.3.5 biomaRt_2.54.1

loaded via a namespace (and not attached): [1] ProtGenerics_1.30.0 bitops_1.0-7 matrixStats_1.0.0 bit64_4.0.5 filelock_1.0.2
[6] progress_1.2.2 httr_1.4.7 tools_4.2.2 utf8_1.2.3 R6_2.5.1
[11] DBI_1.1.3 lazyeval_0.2.2 withr_2.5.0 tidyselect_1.2.0 prettyunits_1.1.1
[16] bit_4.0.5 curl_5.0.2 compiler_4.2.2 cli_3.6.1 DelayedArray_0.24.0
[21] rtracklayer_1.58.0 rappdirs_0.3.3 digest_0.6.33 Rsamtools_2.14.0 XVector_0.38.0
[26] pkgconfig_2.0.3 htmltools_0.5.6 MatrixGenerics_1.10.0 fastmap_1.1.1 rlang_1.1.1
[31] RSQLite_2.3.1 shiny_1.7.5 BiocIO_1.8.0 generics_0.1.3 BiocParallel_1.32.6
[36] RCurl_1.98-1.12 magrittr_2.0.3 GenomeInfoDbData_1.2.9 Matrix_1.5-1 Rcpp_1.0.11
[41] fansi_1.0.4 lifecycle_1.0.3 stringi_1.7.12 yaml_2.3.7 SummarizedExperiment_1.28.0
[46] zlibbioc_1.44.0 grid_4.2.2 blob_1.2.4 parallel_4.2.2 promises_1.2.1
[51] crayon_1.5.2 lattice_0.20-45 Biostrings_2.66.0 hms_1.1.3 KEGGREST_1.38.0
[56] pillar_1.9.0 rjson_0.2.21 codetools_0.2-18 XML_3.99-0.14 glue_1.6.2
[61] BiocVersion_3.16.0 renv_1.0.2 BiocManager_1.30.22 png_0.1-8 vctrs_0.6.3
[66] tzdb_0.4.0 httpuv_1.6.11 purrr_1.0.2 cachem_1.0.8 mime_0.12
[71] xtable_1.8-4 restfulr_0.0.15 later_1.3.1 tibble_3.2.1 GenomicAlignments_1.34.1
[76] memoise_2.0.1 ellipsis_0.3.2 interactiveDisplayBase_1.36.0`

jorainer commented 10 months ago

The example code works because the full genomic region is within the transcript (i.e. one of its exons): see also the respective region in the Ensembl genome browser here.

Your region defines a large genomic region that contains a full transcript and also downstream sequence (see here). The functions to map between genomic and transcript or protein coordinates (genomeToTranscript genomeToProtein) can only map genomic regions that are within an exon of a transcript to the respective coordinates within that transcript (i.e. convert between genomic coordinates and transcript-relative coordinates).

So, these functions can only be used to convert between these coordinate systems. Genomic coordinates that do not code for an exon of a transcript can not be converted to transcript coordinates.

If instead you were looking for a way to identify transcripts that are located in a certain genomic region:

> gnm <- GRanges("6:88109864-88166347")
> transcripts(edb, filter = GRangesFilter(gnm))
GRanges object with 7 ranges and 11 metadata columns:
                  seqnames            ranges strand |           tx_id
                     <Rle>         <IRanges>  <Rle> |     <character>
  ENST00000362094        6 88139864-88145274      - | ENST00000362094
  ENST00000468898        6 88139864-88145274      - | ENST00000468898
              ...      ...               ...    ... .             ...
  ENST00000369501        6 88139864-88166347      - | ENST00000369501
  ENST00000551417        6 88139864-88166347      - | ENST00000551417
                      tx_biotype tx_cds_seq_start tx_cds_seq_end
                     <character>        <integer>      <integer>
  ENST00000362094 protein_coding         88144997       88145274
  ENST00000468898 protein_coding         88143856       88145274
              ...            ...              ...            ...
  ENST00000369501 protein_coding         88143856       88145274
  ENST00000551417 protein_coding         88143856       88145274
                          gene_id tx_support_level     tx_id_version gc_content
                      <character>        <integer>       <character>  <numeric>
  ENST00000362094 ENSG00000118432                1 ENST00000362094.6    40.5797
  ENST00000468898 ENSG00000118432                1 ENST00000468898.2    40.7568
              ...             ...              ...               ...        ...
  ENST00000369501 ENSG00000118432                1 ENST00000369501.3    43.8611
  ENST00000551417 ENSG00000118432                5 ENST00000551417.2    44.0806
                  tx_external_name tx_is_canonical         tx_name
                       <character>       <integer>     <character>
  ENST00000362094         CNR1-201               0 ENST00000362094
  ENST00000468898         CNR1-205               0 ENST00000468898
              ...              ...             ...             ...
  ENST00000369501         CNR1-203               1 ENST00000369501
  ENST00000551417         CNR1-207               0 ENST00000551417
  -------
  seqinfo: 1 sequence from GRCh38 genome
cjagla commented 10 months ago

Okay, thanks for the explanation!