jorainer / ensembldb

This is the ensembldb development repository.
https://jorainer.github.io/ensembldb
33 stars 10 forks source link

transcriptToGenome problem #85

Closed smaegol closed 5 years ago

smaegol commented 5 years ago

Hi,

I've the problem with the transcriptToGenome function, which fails in the case of some transcripts. The error occurs for example for the transcript ENST00000357876. Function is invoked like:

ensdb <- EnsDb.Hsapiens.v86
rng_tx <- IRanges(start = 1, width = 1702,
                  names = "ENST00000357876")

rng_gnm <- transcriptToGenome(rng_tx, ensdb) 

The error is

Error in .to_genome(exns[[names(z)]], z) : 
  The within transcript/CDS coordinates are outside the region defined by the provided exons

The problem is that in the case of multiple transcripts processed at once it breaks the whole pipeline. Maybe the situation where transcript coordinates do not cover exactly annotated exons should not lead to error but rather to warning? Or mapping should be done anyway?

ensembldb_2.4.1 (installed from Bioconductor) R version 3.5.0

jorainer commented 5 years ago

@smaegol , thanks for the feedback! I'll have a closer look at it.

jorainer commented 5 years ago

One comment so far: one possibility would also be to check and fix the IRanges beforehand:

rngs <- IRanges(start = c(2, 1, 1), width = c(40, 1702, 90),
                names = c("ENST00000435945", "ENST00000357876",
                          "ENST00000456328"))

transcriptToGenome(rngs, ensdb)
Error in .to_genome(exns[[names(z)]], z) : 
  The within transcript/CDS coordinates are outside the region defined by the provided exons

lens <- transcriptLengths(ensdb, filter = ~ tx_id == names(rngs))

rngs_fix <- rngs
rng_out <- end(rngs_fix) > lens[names(rngs_fix), "tx_len"]
end(rngs_fix)[rng_out] <- lens[rng_out, "tx_len"]

transcriptToGenome(rngs_fix, ensdb)

GRangesList object of length 3:
$ENST00000435945 
GRanges object with 1 range and 5 metadata columns:
      seqnames            ranges strand |         exon_id           tx_id
         <Rle>         <IRanges>  <Rle> |     <character>     <character>
  [1]        Y 26634612-26634651      - | ENSE00001797328 ENST00000435945
      exon_rank  tx_start    tx_end
      <integer> <integer> <integer>
  [1]         1         2        41

$ENST00000357876 
GRanges object with 2 ranges and 5 metadata columns:
      seqnames        ranges strand |         exon_id           tx_id exon_rank
  [1]        1 597299-597498      - | ENSE00001651821 ENST00000357876         1
  [2]        1 594459-594756      - | ENSE00001652759 ENST00000357876         2
      tx_start tx_end
  [1]        1    498
  [2]        1    498

$ENST00000456328 
GRanges object with 1 range and 5 metadata columns:
      seqnames      ranges strand |         exon_id           tx_id exon_rank
  [1]        1 11869-11958      + | ENSE00002234944 ENST00000456328         1
      tx_start tx_end
  [1]        1     90

-------
seqinfo: 2 sequences from GRCh38 genome
jorainer commented 5 years ago

Dear @smaegol, I've fixed transcriptToGenome to not throw an error when the input coordinates are outside of the transcript's sequence:

> library(ensembldb)
> library(EnsDb.Hsapiens.v86)
> ensdb <- EnsDb.Hsapiens.v86
> rng_tx <- IRanges(start = 1, width = 1702,
+                   names = "ENST00000357876")
>
> rng_gnm <- transcriptToGenome(rng_tx, ensdb) 
Warning messages:
1: The within transcript/CDS coordinates are outside the region defined by the provided exons 
2: In transcriptToGenome(rng_tx, ensdb) :
  1 transcript(s) could either not be found in the database or the specified range is outside the transcript's sequence
>
> rng_gnm
GRangesList object of length 1:
$ENST00000357876 
GRanges object with 0 ranges and 0 metadata columns:
   seqnames    ranges strand
      <Rle> <IRanges>  <Rle>

-------
seqinfo: no sequences
>
> lengths(rng_gnm)
ENST00000357876 
              0 

This fix is included in ensembldb version >= 2.6.1. You can either install that from github (devtools::install_github("jotsetung/ensembldb", ref = "RELEASE_3_8")) or wait 1 or 2 days until you get it as an update via Bioconductor (version 3.8).

jorainer commented 5 years ago

Closing because it has been fixed. Feel free to re-open if needed.