jorainer / ensembldb

This is the ensembldb development repository.
https://jorainer.github.io/ensembldb
33 stars 10 forks source link

Error in normalizeDoubleBracketSubscript(i, x, exact = exact, allow.NA = TRUE, : subscript is out of bounds #126

Closed lgatto closed 2 years ago

lgatto commented 2 years ago

I get the following error where start/end/names indicate start/end ranges of peptides along UnitProt identifiers (the same as discussed over slack yesterday) and edb is

> library(AnnotationHub)
> library(ensembldb)
> ah <- AnnotationHub()
snapshotDate(): 2021-12-20
> edb <- query(ah, c("EnsDB", "sapiens", "94"))[[1]]
loading from cache
> edb
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.4
|Creation time: Sun Dec  9 02:42:14 2018
|ensembl_version: 94
|ensembl_host: localhost
|Organism: Homo sapiens
|taxonomy_id: 9606
|genome_build: GRCh38
|DBSCHEMAVERSION: 2.1
| No. of genes: 65687.
| No. of transcripts: 228432.
|Protein data available.
+ x2 <- ensembldb::proteinToGenome(IRanges(start = x$start[i],
+                                          end = x$end[i],
+                                          names = x$names[i]),
+                                  edb,
+                                  idType = "uniprot_id")
+ 
Fetching CDS for 10000 proteins ... 9371 found
Checking CDS and protein sequence lengths ... Error in normalizeDoubleBracketSubscript(i, x, exact = exact, allow.NA = TRUE,  : 
  subscript is out of bounds
In addition: Warning messages:
1: In ensembldb::proteinToGenome(IRanges(start = x$start[i], end = x$end[i],  :
  No CDS found for: A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3, A9Z1Z3 [... truncated]
2: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein. 
3: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein. 
4: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein. 
5: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein. 

Indeed, I confirm the first warning:

> ensembldb::transcripts(edb, filter = AnnotationFilter::UniprotFilter("A9Z1Z3"))

GRanges object with 0 ranges and 9 metadata columns:
   seqnames    ranges strand |       tx_id  tx_biotype tx_cds_seq_start
      <Rle> <IRanges>  <Rle> | <character> <character>        <integer>
   tx_cds_seq_end     gene_id tx_support_level tx_id_version     tx_name
        <integer> <character>        <integer>   <character> <character>
    uniprot_id
   <character>
  -------
  seqinfo: no sequences
> i <- 1569 ## extract the first range along A9Z1Z3
> ir <- IRanges(start = x$start[i], end = x$end[i], names = x$names[i])
> ir
IRanges object with 1 range and 0 metadata columns:
             start       end     width
         <integer> <integer> <integer>
  A9Z1Z3         1         6         6
> ensembldb::proteinToGenome(ir, edb, idType = "uniprot_id")
Fetching CDS for 1 proteins ... 0 found
Checking CDS and protein sequence lengths ... 0/0 OK
$A9Z1Z3
GRanges object with 0 ranges and 0 metadata columns:
   seqnames    ranges strand
      <Rle> <IRanges>  <Rle>
  -------
  seqinfo: no sequences

Warning message:
In ensembldb::proteinToGenome(ir, edb, idType = "uniprot_id") :
  No CDS found for: A9Z1Z3

Could you help with the error, @jorainer

lgatto commented 2 years ago

The error is due to Q1MSJ5. There exist transcripts for this UniProt identifier:

> ensembldb::transcripts(edb, filter = AnnotationFilter::UniprotFilter("Q1MSJ5"))
GRanges object with 2 ranges and 9 metadata columns:
                  seqnames            ranges strand |           tx_id
                     <Rle>         <IRanges>  <Rle> |     <character>
  ENST00000262210        8 67064368-67196263      + | ENST00000262210
  ENST00000519668        8 67083848-67195611      + | ENST00000519668
                      tx_biotype tx_cds_seq_start tx_cds_seq_end
                     <character>        <integer>      <integer>
  ENST00000262210 protein_coding         67064399       67195593
  ENST00000519668 protein_coding         67095665       67195593
                          gene_id tx_support_level     tx_id_version
                      <character>        <integer>       <character>
  ENST00000262210 ENSG00000104218                1 ENST00000262210.9
  ENST00000519668 ENSG00000104218                1 ENST00000519668.1
                          tx_name  uniprot_id
                      <character> <character>
  ENST00000262210 ENST00000262210      Q1MSJ5
  ENST00000519668 ENST00000519668      Q1MSJ5
  -------
  seqinfo: 1 sequence from GRCh38 genome

But some of the ranges must be out of bounds:

> i <- which(x$names == "Q1MSJ5")
> ir <- IRanges(start = x$start[i], end = x$end[i], names = x$names[i])
> ir
IRanges object with 381 ranges and 0 metadata columns:
             start       end     width
         <integer> <integer> <integer>
  Q1MSJ5         1        15        15
  Q1MSJ5        16        26        11
  Q1MSJ5        31        36         6
  Q1MSJ5        37        49        13
  Q1MSJ5        57        69        13
     ...       ...       ...       ...
  Q1MSJ5      1137      1168        32
  Q1MSJ5      1156      1172        17
  Q1MSJ5      1167      1174         8
  Q1MSJ5      1169      1175         7
  Q1MSJ5      1173      1200        28
> ensembldb::proteinToGenome(ir, edb, idType = "uniprot_id")
Fetching CDS for 381 proteins ... 377 found
Checking CDS and protein sequence lengths ... Error in normalizeDoubleBracketSubscript(i, x, exact = exact, allow.NA = TRUE,  : 
  subscript is out of bounds
In addition: Warning messages:
1: In ensembldb::proteinToGenome(ir, edb, idType = "uniprot_id") :
  No CDS found for: Q1MSJ5, Q1MSJ5, Q1MSJ5, Q1MSJ5
2: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein. 
3: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein. 
4: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein. 
5: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein. 

I'll dig deeper and report back.

lgatto commented 2 years ago

So this is actually rather weird. Elements 69 and 70 are empty (which leads to the out of bounds error - see below for details), but the error only happens when they are with a non-empty element:

> ensembldb::proteinToGenome(ir[68:70], edb, idType = "uniprot_id")
Fetching CDS for 3 proteins ... 1 found
Checking CDS and protein sequence lengths ... Error in normalizeDoubleBracketSubscript(i, x, exact = exact, allow.NA = TRUE,  : 
  subscript is out of bounds
In addition: Warning messages:
1: In ensembldb::proteinToGenome(ir[68:70], edb, idType = "uniprot_id") :
  No CDS found for: Q1MSJ5, Q1MSJ5
2: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein. 
3: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein. 
> ensembldb::proteinToGenome(ir[69], edb, idType = "uniprot_id")
Fetching CDS for 1 proteins ... 0 found
Checking CDS and protein sequence lengths ... 0/0 OK
$Q1MSJ5
GRanges object with 0 ranges and 0 metadata columns:
   seqnames    ranges strand
      <Rle> <IRanges>  <Rle>
  -------
  seqinfo: no sequences

Warning message:
In ensembldb::proteinToGenome(ir[69], edb, idType = "uniprot_id") :
  No CDS found for: Q1MSJ5
> ensembldb::proteinToGenome(ir[70], edb, idType = "uniprot_id")
Fetching CDS for 1 proteins ... 0 found
Checking CDS and protein sequence lengths ... 0/0 OK
$Q1MSJ5
GRanges object with 0 ranges and 0 metadata columns:
   seqnames    ranges strand
      <Rle> <IRanges>  <Rle>
  -------
  seqinfo: no sequences

Warning message:
In ensembldb::proteinToGenome(ir[70], edb, idType = "uniprot_id") :
  No CDS found for: Q1MSJ5
> ensembldb::proteinToGenome(ir[69:70], edb, idType = "uniprot_id")
Fetching CDS for 2 proteins ... 0 found
Checking CDS and protein sequence lengths ... 0/0 OK
$Q1MSJ5
GRanges object with 0 ranges and 0 metadata columns:
   seqnames    ranges strand
      <Rle> <IRanges>  <Rle>
  -------
  seqinfo: no sequences

$Q1MSJ5
GRanges object with 0 ranges and 0 metadata columns:
   seqnames    ranges strand
      <Rle> <IRanges>  <Rle>
  -------
  seqinfo: no sequences

Warning message:
In ensembldb::proteinToGenome(ir[69:70], edb, idType = "uniprot_id") :
  No CDS found for: Q1MSJ5, Q1MSJ5
> ensembldb::proteinToGenome(ir[69:71], edb, idType = "uniprot_id")
Fetching CDS for 3 proteins ... 1 found
Checking CDS and protein sequence lengths ... Error in normalizeDoubleBracketSubscript(i, x, exact = exact, allow.NA = TRUE,  : 
  subscript is out of bounds
In addition: Warning messages:
1: In ensembldb::proteinToGenome(ir[69:71], edb, idType = "uniprot_id") :
  No CDS found for: Q1MSJ5, Q1MSJ5
2: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein. 
3: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein. 

So back to the error, that is thrown here:

    are_ok <- unlist(lapply(cds_genome, function(z) {
        if (is(z, "GRangesList"))
            all(z[[1]]$cds_ok) ## error when z has no ranges
        else NA
    }))

Below, I debug the code and z is element 69

Browse[1]> debug at /tmp/proteinToX.R!1gB7hU#20: are_ok <- unlist(lapply(cds_genome, function(z) {
    if (is(z, "GRangesList")) 
        all(z[[1]]$cds_ok)
    else NA
}))
Browse[2]> z <- cds_genome[[2]]
Browse[2]> z
GRangesList object of length 0:
<0 elements>
Browse[2]> is(z, "GRangesList")
[1] TRUE
Browse[2]> z[[1]]
Error in normalizeDoubleBracketSubscript(i, x, exact = exact, allow.NA = TRUE,  : 
  subscript is out of bounds

This can be fixed with

    are_ok <- unlist(lapply(cds_genome, function(z) {
        if (is(z, "GRangesList") & length(z) > 0)
            all(z[[1]]$cds_ok)
        else NA
    }))

which will come as a PR in a moment (plus another minor issue related that). But it's not over...

> ensembldb::proteinToGenome(ir[68:70], edb, idType = "uniprot_id")
Fetching CDS for 3 proteins ... 1 found
Checking CDS and protein sequence lengths ... 1/3 OK
Error in `[[<-`(`*tmp*`, name, value = 1201L) : 
  1 elements in value to replace 0 elements
In addition: Warning messages:
1: In ensembldb::proteinToGenome(ir[68:70], edb, idType = "uniprot_id") :
  No CDS found for: Q1MSJ5, Q1MSJ5
2: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein. 
3: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein. 

Will continue to investigate and report.

Coming back to the oddity above, as to why the code worked with only empty GRangesList elements... that because the cds_genome list would be made of NULLs only:

cds_genome
$Q1MSJ5
NULL

$Q1MSJ5
NULL

which would, as anticipated, graciously produce

are_ok
Q1MSJ5 Q1MSJ5 
    NA     NA 
lgatto commented 2 years ago

Continuing my stroll along the proteinToGenome() path, the new errors happens here:

    res <- mapply(
        cds_genome, as(coords_cds, "IRangesList"), as(x, "IRangesList"),
        FUN = function(gnm, cds, prt) {
            if (is.null(gnm)) {
                GRanges()
            } else {
                ## Unlist because we'd like to have a GRanges here. Will split
                ## again later.
                maps <- unlist(.to_genome(gnm, cds))
                ## Don't want to have GRanges names!
                names(maps) <- NULL
                mcols(maps)$protein_start <- start(prt) ## ERROR
                mcols(maps)$protein_end <- end(prt)  ## ERROR
                maps[order(maps$exon_rank)]
            }
        })

Given that we have an empty GRangesList in cds_genome and gnm in the mapply() call, maps also becomes an empty GRanges and trying to set any mcols produces the (now obvious)

Error in `[[<-`(`*tmp*`, name, value = 1) : 
  1 elements in value to replace 0 elements

This can be fixes with an additional condition in the test (PR to follow):

...
            if (is.null(gnm) | !length(gnm)) {
                GRanges()
            } else {
...
> ensembldb::proteinToGenome(ir[68:71], edb, idType = "uniprot_id")
Fetching CDS for 4 proteins ... 2 found
Checking CDS and protein sequence lengths ... 2/4 OK
$Q1MSJ5
GRanges object with 1 range and 8 metadata columns:
      seqnames            ranges strand |  uniprot_id           tx_id
         <Rle>         <IRanges>  <Rle> | <character>     <character>
  [1]        8 67195453-67195527      + |      Q1MSJ5 ENST00000262210
           protein_id         exon_id exon_rank    cds_ok protein_start
          <character>     <character> <integer> <logical>     <integer>
  [1] ENSP00000262210 ENSE00002101366        29      TRUE          1176
      protein_end
        <integer>
  [1]        1200
  -------
  seqinfo: 1 sequence from GRCh38 genome

$Q1MSJ5
GRanges object with 0 ranges and 0 metadata columns:
   seqnames    ranges strand
      <Rle> <IRanges>  <Rle>
  -------
  seqinfo: no sequences

$Q1MSJ5
GRanges object with 0 ranges and 0 metadata columns:
   seqnames    ranges strand
      <Rle> <IRanges>  <Rle>
  -------
  seqinfo: no sequences

$Q1MSJ5
GRangesList object of length 2:
$ENSP00000262210
GRanges object with 1 range and 8 metadata columns:
      seqnames            ranges strand |  uniprot_id           tx_id
         <Rle>         <IRanges>  <Rle> | <character>     <character>
  [1]        8 67064399-67064476      + |      Q1MSJ5 ENST00000262210
           protein_id         exon_id exon_rank    cds_ok protein_start
          <character>     <character> <integer> <logical>     <integer>
  [1] ENSP00000262210 ENSE00003513069         1      TRUE             1
      protein_end
        <integer>
  [1]          26
  -------
  seqinfo: 1 sequence from GRCh38 genome

$ENSP00000430092
GRanges object with 2 ranges and 8 metadata columns:
      seqnames            ranges strand |  uniprot_id           tx_id
         <Rle>         <IRanges>  <Rle> | <character>     <character>
  [1]        8 67095665-67095732      + |      Q1MSJ5 ENST00000519668
  [2]        8 67103037-67103046      + |      Q1MSJ5 ENST00000519668
           protein_id         exon_id exon_rank    cds_ok protein_start
          <character>     <character> <integer> <logical>     <integer>
  [1] ENSP00000430092 ENSE00003677216         4      TRUE             1
  [2] ENSP00000430092 ENSE00001284106         5      TRUE             1
      protein_end
        <integer>
  [1]          26
  [2]          26
  -------
  seqinfo: 1 sequence from GRCh38 genome

Warning messages:
1: In ensembldb::proteinToGenome(ir[68:71], edb, idType = "uniprot_id") :
  No CDS found for: Q1MSJ5, Q1MSJ5
2: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein. 
3: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein. 
> ensembldb::proteinToGenome(ir[68:70], edb, idType = "uniprot_id")
Fetching CDS for 3 proteins ... 1 found
Checking CDS and protein sequence lengths ... 1/3 OK
$Q1MSJ5
GRanges object with 1 range and 8 metadata columns:
      seqnames            ranges strand |  uniprot_id           tx_id
         <Rle>         <IRanges>  <Rle> | <character>     <character>
  [1]        8 67195453-67195527      + |      Q1MSJ5 ENST00000262210
           protein_id         exon_id exon_rank    cds_ok protein_start
          <character>     <character> <integer> <logical>     <integer>
  [1] ENSP00000262210 ENSE00002101366        29      TRUE          1176
      protein_end
        <integer>
  [1]        1200
  -------
  seqinfo: 1 sequence from GRCh38 genome

$Q1MSJ5
GRanges object with 0 ranges and 0 metadata columns:
   seqnames    ranges strand
      <Rle> <IRanges>  <Rle>
  -------
  seqinfo: no sequences

$Q1MSJ5
GRanges object with 0 ranges and 0 metadata columns:
   seqnames    ranges strand
      <Rle> <IRanges>  <Rle>
  -------
  seqinfo: no sequences

Warning messages:
1: In ensembldb::proteinToGenome(ir[68:70], edb, idType = "uniprot_id") :
  No CDS found for: Q1MSJ5, Q1MSJ5
2: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein. 
3: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein. 

What I'm worried about is that the originally empty GRanges[List] elements aren't consistent with the others, given that they miss the mcols, but I assume this is something that should be addressed downstream, not here.

jorainer commented 2 years ago

Thanks for the detailed report @lgatto ! The main reason seems to me that there's an issue between the CDS of the transcripts and the protein sequence. That is something I have already observed and that I can not really explain. This is also the main reason for this additional check I'm performing that compares the length of the CDS with the length of the protein sequence - if they don't match you'll get the Could not find a CDS whith the expected length for protein warnings - and in this cases it can well be that the peptide sequence is within the protein sequence but, because the CDS does not match the protein sequence, outside of the CDS (and the CDS is used to map coordinates to the genome).

jorainer commented 2 years ago

Note that there are two different types of Uniprot-to-Ensembl protein mappings (info provided by Ensembl):

> listUniprotMappingTypes(edb)
[1] "DIRECT"         "SEQUENCE_MATCH"

Maybe worth to also try filtering the EnsDb on one of those to see if it reduces the number of mismatches between CDS and protein sequences (or some of the problems)?

edb <- filter(edb, filter = ~ uniprot_mapping_type == "DIRECT")
lgatto commented 2 years ago

There's also the issue documented in #123 where some of the peptide coordinates along the UniProt proteins might actually not be mapped to the ENSEMBL canonical transcripts.

Thanks for the uniprot mapping types hint!