Coordinate retrieval errors on peptide belonging to multiple transcripts

manogenome commented 2 years ago

After running proteinToGenome, I'm observing a case where a peptide, arising from a gene with multiple isoforms, is found to have a mix of within exon and junction coordinates. Upon closer inspection, some of those coordinates turned out to be incorrect.

To illustrate the issue, I calculated the genome coordinates for the peptide, "IKWGDAGAEYVVESTGVFTTMEK", from the protein, P04006 (UniProt id). This returned seven genomic ranges -- three as within exon ranges on three transcripts and the remaining four as belonging to a junction on two transcripts.

Upon extracting the genomic sequence corresponding to the above genomic ranges for the peptide and translating it, we observe that only the three within exon ranges returned the actual peptide, "IKWGDAGAEYVVESTGVFTTMEK". The junction ranges don't match the supplied peptide.

Manually inspecting IRanges returned by proteinToTranscript showed that the junction ranges were shifted from the actual ranges and are incorrect. I suspect the bug to be hiding somewhere in this step. Below I've added a reproducible code to demonstrate the issue using this peptide as an example.

library(GenomicRanges)
library(BSgenome.Hsapiens.UCSC.hg38)
library(ensembldb)
library(AnnotationHub)

ah <- AnnotationHub()
qr <- query(ah, c("EnsDb", "sapiens", "97"))
edb <- qr[[1]]
Hsapiens <- BSgenome.Hsapiens.UCSC.hg38

## Example peptide: IKWGDAGAEYVVESTGVFTTMEK

## define IRanges of the peptide
prt <- IRanges(start = 85, end = 107)
names(prt) <- "P04406"

## peptide coordinate on the transcript
prt_tx <- proteinToTranscript(prt, edb, idType = "uniprot_id")
prt_tx

## peptide coordinate on the genome
prt_gn <- proteinToGenome(prt, edb, idType = "uniprot_id")
stacked <- stack(prt_gn[[1]], index.var = "stack")
stacked

## we can verify this by extracting the genomic regions and translating it
## back to peptides

seqlevelsStyle(stacked) <- "UCSC"
stacked

## extract genomic regions
sequence <- getSeq(Hsapiens,stacked)
sequence

## DNAStringSet object of length 7:
##    width seq
## [1]    69 ATCAAGTGGGGCGATGCTGGCGCTGAGTACGTCGTGGAGTCCACTGGCGTCTTCACCACCATGGAGAAG
## [2]    65 GATGCCCCCATGTTCGTCATGGGTGTGAACCATGAGAAGTATGACAACAGCCTCAAGATCATCAG
## [3]     4 CAAT
## [4]    69 ATCAAGTGGGGCGATGCTGGCGCTGAGTACGTCGTGGAGTCCACTGGCGTCTTCACCACCATGGAGAAG
## [5]    69 ATCAAGTGGGGCGATGCTGGCGCTGAGTACGTCGTGGAGTCCACTGGCGTCTTCACCACCATGGAGAAG
## [6]    65 GATGCCCCCATGTTCGTCATGGGTGTGAACCATGAGAAGTATGACAACAGCCTCAAGATCATCAG
## [7]     4 CAAT

## translate
translate(sequence)

## AAStringSet object of length 7:
##     width seq
## [1]    23 IKWGDAGAEYVVESTGVFTTMEK
## [2]    21 DAPMFVMGVNHEKYDNSLKII
## [3]     1 Q
## [4]    23 IKWGDAGAEYVVESTGVFTTMEK
## [5]    23 IKWGDAGAEYVVESTGVFTTMEK
## [6]    21 DAPMFVMGVNHEKYDNSLKII
## [7]     1 Q

## cooresponding transcripts
stacked$tx_id

## [1] "ENST00000229239"
## [2] "ENST00000396858" 
## [3] "ENST00000396858"
## [4] "ENST00000396859"
## [5] "ENST00000396861"
## [6] "ENST00000619601"
## [7] "ENST00000619601"

> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ensembldb_2.18.2                         AnnotationFilter_1.18.0                  BSgenome.Hsapiens.UCSC.hg38_1.4.4       
 [4] BSgenome_1.62.0                          rtracklayer_1.54.0                       TxDb.Hsapiens.UCSC.hg38.knownGene_3.14.0
 [7] GenomicFeatures_1.46.1                   Biostrings_2.62.0                        XVector_0.34.0                          
[10] SummarizedExperiment_1.24.0              MatrixGenerics_1.6.0                     matrixStats_0.61.0                      
[13] Spectra_1.4.1                            ProtGenerics_1.26.0                      BiocParallel_1.28.1                     
[16] AnnotationDbi_1.56.2                     Biobase_2.54.0                           GenomicRanges_1.46.1                    
[19] GenomeInfoDb_1.30.0                      IRanges_2.28.0                           S4Vectors_0.32.3                        
[22] AnnotationHub_3.2.0                      BiocFileCache_2.2.0                      dbplyr_2.1.1                            
[25] BiocGenerics_0.40.0                     

loaded via a namespace (and not attached):
 [1] bitops_1.0-7                  fs_1.5.0                      bit64_4.0.5                   filelock_1.0.2                progress_1.2.2               
 [6] httr_1.4.2                    tools_4.1.1                   utf8_1.2.2                    R6_2.5.1                      lazyeval_0.2.2               
[11] DBI_1.1.1                     withr_2.4.2                   tidyselect_1.1.1              prettyunits_1.1.1             bit_4.0.4                    
[16] curl_4.3.2                    compiler_4.1.1                cli_3.1.0                     xml2_1.3.2                    DelayedArray_0.20.0          
[21] rappdirs_0.3.3                stringr_1.4.0                 digest_0.6.28                 Rsamtools_2.10.0              rmarkdown_2.11               
[26] pkgconfig_2.0.3               htmltools_0.5.2               fastmap_1.1.0                 rlang_0.4.12                  rstudioapi_0.13              
[31] RSQLite_2.2.8                 shiny_1.7.1                   BiocIO_1.4.0                  generics_0.1.1                dplyr_1.0.7                  
[36] RCurl_1.98-1.5                magrittr_2.0.1                GenomeInfoDbData_1.2.7        Matrix_1.3-4                  Rcpp_1.0.7                   
[41] fansi_0.5.0                   MsCoreUtils_1.6.0             lifecycle_1.0.1               stringi_1.7.5                 yaml_2.2.1                   
[46] MASS_7.3-54                   zlibbioc_1.40.0               grid_4.1.1                    blob_1.2.2                    parallel_4.1.1               
[51] promises_1.2.0.1              crayon_1.4.2                  lattice_0.20-44               hms_1.1.1                     KEGGREST_1.34.0              
[56] knitr_1.36                    pillar_1.6.4                  rjson_0.2.20                  biomaRt_2.50.1                XML_3.99-0.8                 
[61] glue_1.5.0                    BiocVersion_3.14.0            evaluate_0.14                 BiocManager_1.30.16           png_0.1-7                    
[66] vctrs_0.3.8                   httpuv_1.6.3                  purrr_0.3.4                   clue_0.3-60                   assertthat_0.2.1             
[71] cachem_1.0.6                  xfun_0.28                     mime_0.12                     xtable_1.8-4                  restfulr_0.0.13              
[76] later_1.3.0                   tibble_3.1.6                  GenomicAlignments_1.30.0      memoise_2.0.0                 cluster_2.1.2                
[81] ellipsis_0.3.2                interactiveDisplayBase_1.32.0

jorainer commented 2 years ago

Thanks for the detailed report @manogenome ! I'll have a closer look into it.

jorainer commented 2 years ago

The coordinates are all correctly calculated, but the problem is the mapping of Uniprot IDs to Ensembl Protein IDs:

eprt <- proteins(edb, filter = ~ uniprot == "P04406")
## re-order them matching the order in `stacked`
eprt <- eprt[match(eprt$protein_id, levels(stacked$stack)), ]

eprt
DataFrame with 5 rows and 4 columns
            tx_id      protein_id       protein_sequence  uniprot_id
      <character>     <character>            <character> <character>
1 ENST00000229239 ENSP00000229239 MGKVKVGVNGFGRIGRLVTR..      P04406
2 ENST00000396858 ENSP00000380067 MVYMFQYDSTHGKFHGTVKA..      P04406
3 ENST00000396859 ENSP00000380068 MGKVKVGVNGFGRIGRLVTR..      P04406
4 ENST00000396861 ENSP00000380070 MGKVKVGVNGFGRIGRLVTR..      P04406
5 ENST00000619601 ENSP00000478864 MVYMFQYDSTHGKFHGTVKA..      P04406

so, two of the Ensembl protein IDs that are assigned to P04406 have in fact a different amino acid sequence. Not surprisingly, the peptide sequence for the subset is thus also different:

substring(eprt$protein_sequence, 85, 107)
[1] "IKWGDAGAEYVVESTGVFTTMEK" "DAPMFVMGVNHEKYDNSLKIISN"
[3] "IKWGDAGAEYVVESTGVFTTMEK" "IKWGDAGAEYVVESTGVFTTMEK"
[5] "DAPMFVMGVNHEKYDNSLKIISN"

But this fits what you get translating the genomic sequence. So, I don't have a solution for your problem - the annotations and mappings between identifiers are provided by Ensembl (or Uniprot).

manogenome commented 2 years ago

Thank you, @jorainer , for the detailed answer!! This has clarified a few things for me in narrowing down the source of my problems, which I'm currently validating; will post an update on this later this week.

jorainer commented 2 years ago

Excellent! Thank for the feedback!

Cross-database mapping of identifiers has always been a problem in the genomics(/proteomics/metabolomics) field.

manogenome commented 2 years ago

Turns out, there’s a discordance in ensembldb’s proteinToGenome() output between uniprot_id and protein_id as input value for idType parameter. In our example case, the presence of the peptide, IKWGDAGAEYVVESTGVFTTMEK, from the Uniprot ID, P04406, on all the five matching Ensembl sequences ruled out any issue with Uniprot-Ensembl ID mapping. So we can expect to correctly retrieve its corresponding genomic coordinates which should code for this peptide. But that was not the case as we saw above, where we tried to use the uniprot_id and the peptide coordinates to retrieve genome coordinates for all its matching Ensembl proteins.

Further inspection of this peptide’s coordinate on the five matching Ensembl proteins showed that only on the three sequences did the coordinates actually match the position on the UniProt sequence, P04406. When these five peptide coordinates were used by proteinToGenome() function with Ensembl protein_id’s as idType input, all the resulting genome coordinates correctly coded for the peptide - IKWGDAGAEYVVESTGVFTTMEK.

Visualizing these coordinates along with annotation tracks revealed that two transcript isoforms that encoded this peptide had in fact had a different translational start points as opposed to the other three, but all fell within the same exon (ENSE00003678358) on the genome. We can conclude proteinToGenome() function need to take this into account to correctly retrieve genomic coordinates for such isoforms when using peptide coordinates on the canonical uniport protein as its input. We can also observe that the error gets propagated from proteinToTranscript() to proteinToGenome() function by looking at the result, requiring further investigation.

To address this we will need the canonical status of the transcripts to check for exon ids in the isoforms to perform coordinate shift if they've different translational start positions. From version 104, Ensembl has included this canonical tag for the transcripts. I think it would be useful to include this tag as metadata for downstream analysis.

loading libraries

library(BSgenome.Hsapiens.UCSC.hg38)
library(GenomicRanges)
library(ensembldb)
library(AnnotationHub)

Hsapiens <- BSgenome.Hsapiens.UCSC.hg38
ah <- AnnotationHub()
#> snapshotDate(): 2021-10-20
qr <- query(ah, c("EnsDb", "sapiens", "97"))
edb <- qr[[1]]
#> loading from cache
edb
#> EnsDb for Ensembl:
#> |Backend: SQLite
#> |Db type: EnsDb
#> |Type of Gene ID: Ensembl Gene ID
#> |Supporting package: ensembldb
#> |Db created by: ensembldb package from Bioconductor
#> |script_version: 0.3.4
#> |Creation time: Sat Jul  6 01:28:30 2019
#> |ensembl_version: 97
#> |ensembl_host: localhost
#> |Organism: Homo sapiens
#> |taxonomy_id: 9606
#> |genome_build: GRCh38
#> |DBSCHEMAVERSION: 2.1
#> | No. of genes: 67667.
#> | No. of transcripts: 248916.
#> |Protein data available.

1) protein_id based genome coordinate retrieval

## retrieve matching Ensembl proteins for P04406
prts <- proteins(edb, filter = UniprotFilter("P04406"))
prts
#> DataFrame with 5 rows and 4 columns
#>             tx_id      protein_id       protein_sequence  uniprot_id
#>       <character>     <character>            <character> <character>
#> 1 ENST00000229239 ENSP00000229239 MGKVKVGVNGFGRIGRLVTR..      P04406
#> 2 ENST00000396861 ENSP00000380070 MGKVKVGVNGFGRIGRLVTR..      P04406
#> 3 ENST00000396859 ENSP00000380068 MGKVKVGVNGFGRIGRLVTR..      P04406
#> 4 ENST00000396858 ENSP00000380067 MVYMFQYDSTHGKFHGTVKA..      P04406
#> 5 ENST00000619601 ENSP00000478864 MVYMFQYDSTHGKFHGTVKA..      P04406

## peptide's start coordinates of ENSPs
prts$start <-
  nchar(do.call(
    "rbind",
    base::strsplit(prts$protein_sequence, 'IKWGDAGAEYVVESTGVFTTMEK')
  )[, 1]) + 1

## peptide end coordinates of ENSPs
prts$end <- prts$start + nchar('IKWGDAGAEYVVESTGVFTTMEK') - 1
prts
#> DataFrame with 5 rows and 6 columns
#>             tx_id      protein_id       protein_sequence  uniprot_id     start
#>       <character>     <character>            <character> <character> <numeric>
#> 1 ENST00000229239 ENSP00000229239 MGKVKVGVNGFGRIGRLVTR..      P04406        85
#> 2 ENST00000396861 ENSP00000380070 MGKVKVGVNGFGRIGRLVTR..      P04406        85
#> 3 ENST00000396859 ENSP00000380068 MGKVKVGVNGFGRIGRLVTR..      P04406        85
#> 4 ENST00000396858 ENSP00000380067 MVYMFQYDSTHGKFHGTVKA..      P04406        43
#> 5 ENST00000619601 ENSP00000478864 MVYMFQYDSTHGKFHGTVKA..      P04406        43
#>         end
#>   <numeric>
#> 1       107
#> 2       107
#> 3       107
#> 4        65
#> 5        65

## peptide ranges on ENSPs
ir <- IRanges(start = prts$start, end = prts$end, names = prts$protein_id)
ir
#> IRanges object with 5 ranges and 0 metadata columns:
#>                       start       end     width
#>                   <integer> <integer> <integer>
#>   ENSP00000229239        85       107        23
#>   ENSP00000380070        85       107        23
#>   ENSP00000380068        85       107        23
#>   ENSP00000380067        43        65        23
#>   ENSP00000478864        43        65        23

## peptide coordinates on ENSTs
trl <- proteinToTranscript(ir, edb, idType = "protein_id")
#> Fetching CDS for 5 proteins ... 5 found
#> Checking CDS and protein sequence lengths ... 5/5 OK
tr <- unlist(trl, use.names = FALSE)
tr
#> IRanges object with 5 ranges and 5 metadata columns:
#>                       start       end     width |      protein_id
#>                   <integer> <integer> <integer> |     <character>
#>   ENST00000229239       329       397        69 | ENSP00000229239
#>   ENST00000396861       404       472        69 | ENSP00000380070
#>   ENST00000396859       332       400        69 | ENSP00000380068
#>   ENST00000396858       392       460        69 | ENSP00000380067
#>   ENST00000619601       127       195        69 | ENSP00000478864
#>                             tx_id    cds_ok protein_start protein_end
#>                       <character> <logical>     <integer>   <integer>
#>   ENST00000229239 ENST00000229239      TRUE            85         107
#>   ENST00000396861 ENST00000396861      TRUE            85         107
#>   ENST00000396859 ENST00000396859      TRUE            85         107
#>   ENST00000396858 ENST00000396858      TRUE            43          65
#>   ENST00000619601 ENST00000619601      TRUE            43          65

## peptide coordinates on ENSG
grl <- proteinToGenome(ir, edb, idType = "protein_id")
#> Fetching CDS for 5 proteins ... 5 found
#> Checking CDS and protein sequence lengths ... 5/5 OK
gr <- unlist(GRangesList(grl))
gr
#> GRanges object with 5 ranges and 7 metadata columns:
#>                   seqnames          ranges strand |      protein_id
#>                      <Rle>       <IRanges>  <Rle> |     <character>
#>   ENSP00000229239       12 6536936-6537004      + | ENSP00000229239
#>   ENSP00000380070       12 6536936-6537004      + | ENSP00000380070
#>   ENSP00000380068       12 6536936-6537004      + | ENSP00000380068
#>   ENSP00000380067       12 6536936-6537004      + | ENSP00000380067
#>   ENSP00000478864       12 6536936-6537004      + | ENSP00000478864
#>                             tx_id         exon_id exon_rank    cds_ok
#>                       <character>     <character> <integer> <logical>
#>   ENSP00000229239 ENST00000229239 ENSE00003678358         5      TRUE
#>   ENSP00000380070 ENST00000396861 ENSE00003678358         5      TRUE
#>   ENSP00000380068 ENST00000396859 ENSE00003678358         4      TRUE
#>   ENSP00000380067 ENST00000396858 ENSE00003678358         4      TRUE
#>   ENSP00000478864 ENST00000619601 ENSE00003678358         3      TRUE
#>                   protein_start protein_end
#>                       <integer>   <integer>
#>   ENSP00000229239            85         107
#>   ENSP00000380070            85         107
#>   ENSP00000380068            85         107
#>   ENSP00000380067            43          65
#>   ENSP00000478864            43          65
#>   -------
#>   seqinfo: 1 sequence from GRCh38 genome

## extract genomic sequences
seqlevelsStyle(gr) <- "UCSC"
sequence <- getSeq(Hsapiens,gr)
sequence
#> DNAStringSet object of length 5:
#>     width seq                                               names               
#> [1]    69 ATCAAGTGGGGCGATGCTGGCGC...GCGTCTTCACCACCATGGAGAAG ENSP00000229239
#> [2]    69 ATCAAGTGGGGCGATGCTGGCGC...GCGTCTTCACCACCATGGAGAAG ENSP00000380070
#> [3]    69 ATCAAGTGGGGCGATGCTGGCGC...GCGTCTTCACCACCATGGAGAAG ENSP00000380068
#> [4]    69 ATCAAGTGGGGCGATGCTGGCGC...GCGTCTTCACCACCATGGAGAAG ENSP00000380067
#> [5]    69 ATCAAGTGGGGCGATGCTGGCGC...GCGTCTTCACCACCATGGAGAAG ENSP00000478864

## shows the retrieved coordinates code for the same peptide
translate(sequence)
#> AAStringSet object of length 5:
#>     width seq                                               names               
#> [1]    23 IKWGDAGAEYVVESTGVFTTMEK                           ENSP00000229239
#> [2]    23 IKWGDAGAEYVVESTGVFTTMEK                           ENSP00000380070
#> [3]    23 IKWGDAGAEYVVESTGVFTTMEK                           ENSP00000380068
#> [4]    23 IKWGDAGAEYVVESTGVFTTMEK                           ENSP00000380067
#> [5]    23 IKWGDAGAEYVVESTGVFTTMEK                           ENSP00000478864

ENSP-based_GRanges

2) uniprot_id based genome coordinate retrieval

edb <- qr[[1]]
#> loading from cache

## let's define IRanges for the peptide
ir <- IRanges(start = 85, end = 107)
names(ir) <- "P04406"

## peptide coordinates on ENSTs (we can see two transcripts with incorrect coords)
trl <- proteinToTranscript(ir, edb, idType = "uniprot_id")
#> Fetching CDS for 1 proteins ...
#> 1 found
#> Checking CDS and protein sequence lengths ... 1/1 OK
tr <- unlist(trl, use.names = FALSE)
tr
#> IRanges object with 5 ranges and 6 metadata columns:
#>                       start       end     width |      protein_id
#>                   <integer> <integer> <integer> |     <character>
#>   ENST00000229239       329       397        69 | ENSP00000229239
#>   ENST00000396858       518       586        69 | ENSP00000380067
#>   ENST00000396859       332       400        69 | ENSP00000380068
#>   ENST00000396861       404       472        69 | ENSP00000380070
#>   ENST00000619601       253       321        69 | ENSP00000478864
#>                             tx_id    cds_ok protein_start protein_end
#>                       <character> <logical>     <integer>   <integer>
#>   ENST00000229239 ENST00000229239      TRUE            85         107
#>   ENST00000396858 ENST00000396858      TRUE            85         107
#>   ENST00000396859 ENST00000396859      TRUE            85         107
#>   ENST00000396861 ENST00000396861      TRUE            85         107
#>   ENST00000619601 ENST00000619601      TRUE            85         107
#>                    uniprot_id
#>                   <character>
#>   ENST00000229239      P04406
#>   ENST00000396858      P04406
#>   ENST00000396859      P04406
#>   ENST00000396861      P04406
#>   ENST00000619601      P04406

## peptide coordinates on ENSG
prt_gn <- proteinToGenome(ir, edb, idType = "uniprot_id")
#> Fetching CDS for 1 proteins ... 1 found
#> Checking CDS and protein sequence lengths ... 1/1 OK
gr <- stack(prt_gn[[1]], index.var = "stack")
gr
#> GRanges object with 7 ranges and 9 metadata columns:
#>       seqnames          ranges strand |           stack  uniprot_id
#>          <Rle>       <IRanges>  <Rle> |           <Rle> <character>
#>   [1]       12 6536936-6537004      + | ENSP00000229239      P04406
#>   [2]       12 6537152-6537216      + | ENSP00000380067      P04406
#>   [3]       12 6537309-6537312      + | ENSP00000380067      P04406
#>   [4]       12 6536936-6537004      + | ENSP00000380068      P04406
#>   [5]       12 6536936-6537004      + | ENSP00000380070      P04406
#>   [6]       12 6537152-6537216      + | ENSP00000478864      P04406
#>   [7]       12 6537309-6537312      + | ENSP00000478864      P04406
#>                 tx_id      protein_id         exon_id exon_rank    cds_ok
#>           <character>     <character>     <character> <integer> <logical>
#>   [1] ENST00000229239 ENSP00000229239 ENSE00003678358         5      TRUE
#>   [2] ENST00000396858 ENSP00000380067 ENSE00003562150         5      TRUE
#>   [3] ENST00000396858 ENSP00000380067 ENSE00003663529         6      TRUE
#>   [4] ENST00000396859 ENSP00000380068 ENSE00003678358         4      TRUE
#>   [5] ENST00000396861 ENSP00000380070 ENSE00003678358         5      TRUE
#>   [6] ENST00000619601 ENSP00000478864 ENSE00003562150         4      TRUE
#>   [7] ENST00000619601 ENSP00000478864 ENSE00003663529         5      TRUE
#>       protein_start protein_end
#>           <integer>   <integer>
#>   [1]            85         107
#>   [2]            85         107
#>   [3]            85         107
#>   [4]            85         107
#>   [5]            85         107
#>   [6]            85         107
#>   [7]            85         107
#>   -------
#>   seqinfo: 1 sequence from GRCh38 genome

## extract genomic sequences
seqlevelsStyle(gr) <- "UCSC"
sequence <- getSeq(Hsapiens,gr)
sequence
#> DNAStringSet object of length 7:
#>     width seq
#> [1]    69 ATCAAGTGGGGCGATGCTGGCGCTGAGTACGTCGTGGAGTCCACTGGCGTCTTCACCACCATGGAGAAG
#> [2]    65 GATGCCCCCATGTTCGTCATGGGTGTGAACCATGAGAAGTATGACAACAGCCTCAAGATCATCAG
#> [3]     4 CAAT
#> [4]    69 ATCAAGTGGGGCGATGCTGGCGCTGAGTACGTCGTGGAGTCCACTGGCGTCTTCACCACCATGGAGAAG
#> [5]    69 ATCAAGTGGGGCGATGCTGGCGCTGAGTACGTCGTGGAGTCCACTGGCGTCTTCACCACCATGGAGAAG
#> [6]    65 GATGCCCCCATGTTCGTCATGGGTGTGAACCATGAGAAGTATGACAACAGCCTCAAGATCATCAG
#> [7]     4 CAAT

## confirms we have indeed retrieved incorrect coordinates for two transcripts 
## with different translational start positions.
translate(sequence)
#> Warning in .Call2("DNAStringSet_translate", x, skip_code,
#> dna_codes[codon_alphabet], : in 'x[[2]]': last 2 bases were ignored
#> Warning in .Call2("DNAStringSet_translate", x, skip_code,
#> dna_codes[codon_alphabet], : in 'x[[3]]': last base was ignored
#> Warning in .Call2("DNAStringSet_translate", x, skip_code,
#> dna_codes[codon_alphabet], : in 'x[[6]]': last 2 bases were ignored
#> Warning in .Call2("DNAStringSet_translate", x, skip_code,
#> dna_codes[codon_alphabet], : in 'x[[7]]': last base was ignored
#> AAStringSet object of length 7:
#>     width seq
#> [1]    23 IKWGDAGAEYVVESTGVFTTMEK
#> [2]    21 DAPMFVMGVNHEKYDNSLKII
#> [3]     1 Q
#> [4]    23 IKWGDAGAEYVVESTGVFTTMEK
#> [5]    23 IKWGDAGAEYVVESTGVFTTMEK
#> [6]    21 DAPMFVMGVNHEKYDNSLKII
#> [7]     1 Q

UniprotID-based_GRanges

Session info

``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.1.1 (2021-08-10) #> os Ubuntu 18.04.5 LTS #> system x86_64, linux-gnu #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Brussels #> date 2021-12-16 #> pandoc 2.11.4 @ /usr/lib/rstudio-server/bin/pandoc/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> AnnotationDbi * 1.56.2 2021-11-09 [1] Bioconductor #> AnnotationFilter * 1.18.0 2021-10-26 [1] Bioconductor #> AnnotationHub * 3.2.0 2021-10-26 [1] Bioconductor #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.1) #> Biobase * 2.54.0 2021-10-26 [1] Bioconductor #> BiocFileCache * 2.2.0 2021-10-26 [1] Bioconductor #> BiocGenerics * 0.40.0 2021-10-26 [1] Bioconductor #> BiocIO 1.4.0 2021-10-26 [1] Bioconductor #> BiocManager 1.30.16 2021-06-15 [1] CRAN (R 4.1.1) #> BiocParallel 1.28.2 2021-11-25 [1] Bioconductor #> BiocVersion 3.14.0 2021-05-19 [1] Bioconductor #> biomaRt 2.50.1 2021-11-21 [1] Bioconductor #> Biostrings * 2.62.0 2021-10-26 [1] Bioconductor #> bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.1) #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.1) #> bitops 1.0-7 2021-04-24 [1] CRAN (R 4.1.1) #> blob 1.2.2 2021-07-23 [1] CRAN (R 4.1.1) #> BSgenome * 1.62.0 2021-10-26 [1] Bioconductor #> BSgenome.Hsapiens.UCSC.hg38 * 1.4.4 2021-11-26 [1] Bioconductor #> cachem 1.0.6 2021-08-19 [1] CRAN (R 4.1.1) #> cli 3.1.0 2021-10-27 [1] CRAN (R 4.1.1) #> crayon 1.4.2 2021-10-29 [1] CRAN (R 4.1.1) #> curl 4.3.2 2021-06-23 [1] CRAN (R 4.1.1) #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.1) #> dbplyr * 2.1.1 2021-04-06 [1] CRAN (R 4.1.1) #> DelayedArray 0.20.0 2021-10-26 [1] Bioconductor #> digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.1) #> dplyr 1.0.7 2021-06-18 [1] CRAN (R 4.1.1) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.1) #> ensembldb * 2.18.2 2021-11-08 [1] Bioconductor #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.1) #> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.1) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.1) #> filelock 1.0.2 2018-10-05 [1] CRAN (R 4.1.1) #> fs 1.5.1 2021-11-30 [1] CRAN (R 4.1.1) #> generics 0.1.1 2021-10-25 [1] CRAN (R 4.1.1) #> GenomeInfoDb * 1.30.0 2021-10-26 [1] Bioconductor #> GenomeInfoDbData 1.2.7 2021-09-28 [1] Bioconductor #> GenomicAlignments 1.30.0 2021-10-26 [1] Bioconductor #> GenomicFeatures * 1.46.1 2021-10-27 [1] Bioconductor #> GenomicRanges * 1.46.1 2021-11-18 [1] Bioconductor #> glue 1.5.1 2021-11-30 [1] CRAN (R 4.1.1) #> hms 1.1.1 2021-09-26 [1] CRAN (R 4.1.1) #> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.1) #> httpuv 1.6.3 2021-09-09 [1] CRAN (R 4.1.1) #> httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.1) #> interactiveDisplayBase 1.32.0 2021-10-26 [1] Bioconductor #> IRanges * 2.28.0 2021-10-26 [1] Bioconductor #> KEGGREST 1.34.0 2021-10-26 [1] Bioconductor #> knitr 1.36 2021-09-29 [1] CRAN (R 4.1.1) #> later 1.3.0 2021-08-18 [1] CRAN (R 4.1.1) #> lattice 0.20-44 2021-05-02 [4] CRAN (R 4.1.0) #> lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.1.1) #> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.1) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.1) #> Matrix 1.3-4 2021-06-01 [4] CRAN (R 4.1.0) #> MatrixGenerics 1.6.0 2021-10-26 [1] Bioconductor #> matrixStats 0.61.0 2021-09-17 [1] CRAN (R 4.1.1) #> memoise 2.0.1 2021-11-26 [1] CRAN (R 4.1.1) #> mime 0.12 2021-09-28 [1] CRAN (R 4.1.1) #> pillar 1.6.4 2021-10-18 [1] CRAN (R 4.1.1) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.1) #> png 0.1-7 2013-12-03 [1] CRAN (R 4.1.1) #> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.1) #> progress 1.2.2 2019-05-16 [1] CRAN (R 4.1.1) #> promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.1.1) #> ProtGenerics 1.26.0 2021-10-26 [1] Bioconductor #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.1) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.1) #> rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.1.1) #> Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.1.1) #> RCurl 1.98-1.5 2021-09-17 [1] CRAN (R 4.1.1) #> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.1) #> restfulr 0.0.13 2017-08-06 [1] CRAN (R 4.1.1) #> rjson 0.2.20 2018-06-08 [1] CRAN (R 4.1.1) #> rlang 0.4.12 2021-10-18 [1] CRAN (R 4.1.1) #> rmarkdown 2.11 2021-09-14 [1] CRAN (R 4.1.1) #> Rsamtools 2.10.0 2021-10-26 [1] Bioconductor #> RSQLite 2.2.9 2021-12-06 [1] CRAN (R 4.1.1) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.1) #> rtracklayer * 1.54.0 2021-10-26 [1] Bioconductor #> S4Vectors * 0.32.3 2021-11-21 [1] Bioconductor #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.1) #> shiny 1.7.1 2021-10-02 [1] CRAN (R 4.1.1) #> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.1) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.1) #> SummarizedExperiment 1.24.0 2021-10-26 [1] Bioconductor #> tibble 3.1.6 2021-11-07 [1] CRAN (R 4.1.1) #> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.1) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.1) #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.1) #> withr 2.4.3 2021-11-30 [1] CRAN (R 4.1.1) #> xfun 0.28 2021-11-04 [1] CRAN (R 4.1.1) #> XML 3.99-0.8 2021-09-17 [1] CRAN (R 4.1.1) #> xml2 1.3.3 2021-11-30 [1] CRAN (R 4.1.1) #> xtable 1.8-4 2019-04-21 [1] CRAN (R 4.1.1) #> XVector * 0.34.0 2021-10-26 [1] Bioconductor #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.1) #> zlibbioc 1.40.0 2021-10-26 [1] Bioconductor #> #> [1] /home/manoj/R/x86_64-pc-linux-gnu-library/4.1 #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library #> #> ────────────────────────────────────────────────────────────────────────────── ```

jorainer commented 2 years ago

Thanks for this very comprehensive and detailed summary @manogenome !

So, if I understand correctly, restricting genomic mapping using canonical transcript of a gene would solve this?

manogenome commented 2 years ago

Yes, this would help us correctly retrieve the genome coordinate for the peptide as we use uniprot_id and peptide coordinates as inputs. We would still need genome coordinates for the peptide on all the isoforms. Now that we know the exon id(s) for the peptide on the canonical transcript, we can simply assign (instead of recalculating) the genome coordinates to all the transcript isoforms that share this peptide's exon id(s) and drop the other isoforms. This would require some validation. But first, we will need the canonical tag (Ensembl version >= 104).

jorainer commented 2 years ago

Once you have genomic coordinates you could also use the genomeToTranscript to get it's position within all transcripts (although that will not consider or test the coding region!).

I have now a version of a EnsDb database for Homo sapiens with an additional column tx_is_canonical for the transcript table. Unfortunately, the example from above does not work for Ensembl release 105 because now there is no "P04406" Uniprot ID available, but two different ones: "P04406-1" (returns correct coordinates/peptide sequence) and "P04406-2" (returns the 2 wrong/different peptide sequences).

With the update it would be e.g. possible to filter the EnsDb database to only canonical transcripts and then perform the mapping only with that. Would that be a working solution for you? Note that this would then only work with Ensembl release 104 and 105 but not earlier versions. Before re-building all EnsDb databases for these two releases I would like you to test the new functionality to see if it would be useful at all. @manogenome , please let me know and I can provide you with the updated EnsDb (for human) and the updated ensembldb package.

manogenome commented 2 years ago

Thanks, @jorainer for adding this tag. You're right this tag will only work for newer Ensembl releases >= 104. I just cross-checked the hyperlinks for P04406-1 and P04406-2 in the below table. We can see P04406-1 is marked as canonical and matches three isoforms. And P04406-2 matches two isoforms.

P04406-1

I also went through the Uniprot reviewed fasta file for humans. It only had a P04406 tag for the canonical sequence. It's possible Uniprot will update this tag to P04406-1 later on. I'm happy to test the usefulness of this tag. You can share the updated EnsDb (for humans) and the ensembldb package with me for testing this functionality when it's ready.

jorainer commented 2 years ago

You can install the updated version of ensembldb with BiocManager::install("jorainer/ensembldb").

The EnsDb with the updated annotation is availeble here (the one for Ensembl 105).

You could now restrict/subset the whole EnsDb database to only canonical transcript with:

edb <- filter(edb, filter = ~ tx_is_canonical == 1)

manogenome commented 2 years ago

Thank you, @jorainer !! The updated package, database and the new tx_is_canonical filter are working fine. Now, I'll look at the isoform coordinates issue and get back to you.

lgatto commented 2 years ago

Thanks @jorainer - will that flag be available on the database itself in the package?

jorainer commented 2 years ago

The annotation is available as an additional column to the transcript database table ("tx_is_canonical"). All EnsDb databases for Ensembl release 105 currently available in AnnotationHub have this additional column.

ensembldb in addition provides a new filter (TxIsCanonicalFilter) that allows to filter/subset an EnsDb database by that information.

jorainer / ensembldb