Open manogenome opened 2 years ago
Thanks for the detailed report @manogenome ! I'll have a closer look into it.
The coordinates are all correctly calculated, but the problem is the mapping of Uniprot IDs to Ensembl Protein IDs:
eprt <- proteins(edb, filter = ~ uniprot == "P04406")
## re-order them matching the order in `stacked`
eprt <- eprt[match(eprt$protein_id, levels(stacked$stack)), ]
eprt
DataFrame with 5 rows and 4 columns
tx_id protein_id protein_sequence uniprot_id
<character> <character> <character> <character>
1 ENST00000229239 ENSP00000229239 MGKVKVGVNGFGRIGRLVTR.. P04406
2 ENST00000396858 ENSP00000380067 MVYMFQYDSTHGKFHGTVKA.. P04406
3 ENST00000396859 ENSP00000380068 MGKVKVGVNGFGRIGRLVTR.. P04406
4 ENST00000396861 ENSP00000380070 MGKVKVGVNGFGRIGRLVTR.. P04406
5 ENST00000619601 ENSP00000478864 MVYMFQYDSTHGKFHGTVKA.. P04406
so, two of the Ensembl protein IDs that are assigned to P04406 have in fact a different amino acid sequence. Not surprisingly, the peptide sequence for the subset is thus also different:
substring(eprt$protein_sequence, 85, 107)
[1] "IKWGDAGAEYVVESTGVFTTMEK" "DAPMFVMGVNHEKYDNSLKIISN"
[3] "IKWGDAGAEYVVESTGVFTTMEK" "IKWGDAGAEYVVESTGVFTTMEK"
[5] "DAPMFVMGVNHEKYDNSLKIISN"
But this fits what you get translating the genomic sequence. So, I don't have a solution for your problem - the annotations and mappings between identifiers are provided by Ensembl (or Uniprot).
Thank you, @jorainer , for the detailed answer!! This has clarified a few things for me in narrowing down the source of my problems, which I'm currently validating; will post an update on this later this week.
Excellent! Thank for the feedback!
Cross-database mapping of identifiers has always been a problem in the genomics(/proteomics/metabolomics) field.
Turns out, there’s a discordance in ensembldb’s proteinToGenome() output between uniprot_id
and protein_id
as input value for idType
parameter. In our example case, the presence of the peptide, IKWGDAGAEYVVESTGVFTTMEK
, from the Uniprot ID, P04406, on all the five matching Ensembl sequences ruled out any issue with Uniprot-Ensembl ID mapping. So we can expect to correctly retrieve its corresponding genomic coordinates which should code for this peptide. But that was not the case as we saw above, where we tried to use the uniprot_id
and the peptide coordinates to retrieve genome coordinates for all its matching Ensembl proteins.
Further inspection of this peptide’s coordinate on the five matching Ensembl proteins showed that only on the three sequences did the coordinates actually match the position on the UniProt sequence, P04406. When these five peptide coordinates were used by proteinToGenome() function with Ensembl protein_id
’s as idType
input, all the resulting genome coordinates correctly coded for the peptide - IKWGDAGAEYVVESTGVFTTMEK
.
Visualizing these coordinates along with annotation tracks revealed that two transcript isoforms that encoded this peptide had in fact had a different translational start points as opposed to the other three, but all fell within the same exon (ENSE00003678358) on the genome. We can conclude proteinToGenome() function need to take this into account to correctly retrieve genomic coordinates for such isoforms when using peptide coordinates on the canonical uniport protein as its input. We can also observe that the error gets propagated from proteinToTranscript() to proteinToGenome() function by looking at the result, requiring further investigation.
To address this we will need the canonical status of the transcripts to check for exon ids in the isoforms to perform coordinate shift if they've different translational start positions. From version 104, Ensembl has included this canonical tag for the transcripts. I think it would be useful to include this tag as metadata for downstream analysis.
library(BSgenome.Hsapiens.UCSC.hg38)
library(GenomicRanges)
library(ensembldb)
library(AnnotationHub)
Hsapiens <- BSgenome.Hsapiens.UCSC.hg38
ah <- AnnotationHub()
#> snapshotDate(): 2021-10-20
qr <- query(ah, c("EnsDb", "sapiens", "97"))
edb <- qr[[1]]
#> loading from cache
edb
#> EnsDb for Ensembl:
#> |Backend: SQLite
#> |Db type: EnsDb
#> |Type of Gene ID: Ensembl Gene ID
#> |Supporting package: ensembldb
#> |Db created by: ensembldb package from Bioconductor
#> |script_version: 0.3.4
#> |Creation time: Sat Jul 6 01:28:30 2019
#> |ensembl_version: 97
#> |ensembl_host: localhost
#> |Organism: Homo sapiens
#> |taxonomy_id: 9606
#> |genome_build: GRCh38
#> |DBSCHEMAVERSION: 2.1
#> | No. of genes: 67667.
#> | No. of transcripts: 248916.
#> |Protein data available.
## retrieve matching Ensembl proteins for P04406
prts <- proteins(edb, filter = UniprotFilter("P04406"))
prts
#> DataFrame with 5 rows and 4 columns
#> tx_id protein_id protein_sequence uniprot_id
#> <character> <character> <character> <character>
#> 1 ENST00000229239 ENSP00000229239 MGKVKVGVNGFGRIGRLVTR.. P04406
#> 2 ENST00000396861 ENSP00000380070 MGKVKVGVNGFGRIGRLVTR.. P04406
#> 3 ENST00000396859 ENSP00000380068 MGKVKVGVNGFGRIGRLVTR.. P04406
#> 4 ENST00000396858 ENSP00000380067 MVYMFQYDSTHGKFHGTVKA.. P04406
#> 5 ENST00000619601 ENSP00000478864 MVYMFQYDSTHGKFHGTVKA.. P04406
## peptide's start coordinates of ENSPs
prts$start <-
nchar(do.call(
"rbind",
base::strsplit(prts$protein_sequence, 'IKWGDAGAEYVVESTGVFTTMEK')
)[, 1]) + 1
## peptide end coordinates of ENSPs
prts$end <- prts$start + nchar('IKWGDAGAEYVVESTGVFTTMEK') - 1
prts
#> DataFrame with 5 rows and 6 columns
#> tx_id protein_id protein_sequence uniprot_id start
#> <character> <character> <character> <character> <numeric>
#> 1 ENST00000229239 ENSP00000229239 MGKVKVGVNGFGRIGRLVTR.. P04406 85
#> 2 ENST00000396861 ENSP00000380070 MGKVKVGVNGFGRIGRLVTR.. P04406 85
#> 3 ENST00000396859 ENSP00000380068 MGKVKVGVNGFGRIGRLVTR.. P04406 85
#> 4 ENST00000396858 ENSP00000380067 MVYMFQYDSTHGKFHGTVKA.. P04406 43
#> 5 ENST00000619601 ENSP00000478864 MVYMFQYDSTHGKFHGTVKA.. P04406 43
#> end
#> <numeric>
#> 1 107
#> 2 107
#> 3 107
#> 4 65
#> 5 65
## peptide ranges on ENSPs
ir <- IRanges(start = prts$start, end = prts$end, names = prts$protein_id)
ir
#> IRanges object with 5 ranges and 0 metadata columns:
#> start end width
#> <integer> <integer> <integer>
#> ENSP00000229239 85 107 23
#> ENSP00000380070 85 107 23
#> ENSP00000380068 85 107 23
#> ENSP00000380067 43 65 23
#> ENSP00000478864 43 65 23
## peptide coordinates on ENSTs
trl <- proteinToTranscript(ir, edb, idType = "protein_id")
#> Fetching CDS for 5 proteins ... 5 found
#> Checking CDS and protein sequence lengths ... 5/5 OK
tr <- unlist(trl, use.names = FALSE)
tr
#> IRanges object with 5 ranges and 5 metadata columns:
#> start end width | protein_id
#> <integer> <integer> <integer> | <character>
#> ENST00000229239 329 397 69 | ENSP00000229239
#> ENST00000396861 404 472 69 | ENSP00000380070
#> ENST00000396859 332 400 69 | ENSP00000380068
#> ENST00000396858 392 460 69 | ENSP00000380067
#> ENST00000619601 127 195 69 | ENSP00000478864
#> tx_id cds_ok protein_start protein_end
#> <character> <logical> <integer> <integer>
#> ENST00000229239 ENST00000229239 TRUE 85 107
#> ENST00000396861 ENST00000396861 TRUE 85 107
#> ENST00000396859 ENST00000396859 TRUE 85 107
#> ENST00000396858 ENST00000396858 TRUE 43 65
#> ENST00000619601 ENST00000619601 TRUE 43 65
## peptide coordinates on ENSG
grl <- proteinToGenome(ir, edb, idType = "protein_id")
#> Fetching CDS for 5 proteins ... 5 found
#> Checking CDS and protein sequence lengths ... 5/5 OK
gr <- unlist(GRangesList(grl))
gr
#> GRanges object with 5 ranges and 7 metadata columns:
#> seqnames ranges strand | protein_id
#> <Rle> <IRanges> <Rle> | <character>
#> ENSP00000229239 12 6536936-6537004 + | ENSP00000229239
#> ENSP00000380070 12 6536936-6537004 + | ENSP00000380070
#> ENSP00000380068 12 6536936-6537004 + | ENSP00000380068
#> ENSP00000380067 12 6536936-6537004 + | ENSP00000380067
#> ENSP00000478864 12 6536936-6537004 + | ENSP00000478864
#> tx_id exon_id exon_rank cds_ok
#> <character> <character> <integer> <logical>
#> ENSP00000229239 ENST00000229239 ENSE00003678358 5 TRUE
#> ENSP00000380070 ENST00000396861 ENSE00003678358 5 TRUE
#> ENSP00000380068 ENST00000396859 ENSE00003678358 4 TRUE
#> ENSP00000380067 ENST00000396858 ENSE00003678358 4 TRUE
#> ENSP00000478864 ENST00000619601 ENSE00003678358 3 TRUE
#> protein_start protein_end
#> <integer> <integer>
#> ENSP00000229239 85 107
#> ENSP00000380070 85 107
#> ENSP00000380068 85 107
#> ENSP00000380067 43 65
#> ENSP00000478864 43 65
#> -------
#> seqinfo: 1 sequence from GRCh38 genome
## extract genomic sequences
seqlevelsStyle(gr) <- "UCSC"
sequence <- getSeq(Hsapiens,gr)
sequence
#> DNAStringSet object of length 5:
#> width seq names
#> [1] 69 ATCAAGTGGGGCGATGCTGGCGC...GCGTCTTCACCACCATGGAGAAG ENSP00000229239
#> [2] 69 ATCAAGTGGGGCGATGCTGGCGC...GCGTCTTCACCACCATGGAGAAG ENSP00000380070
#> [3] 69 ATCAAGTGGGGCGATGCTGGCGC...GCGTCTTCACCACCATGGAGAAG ENSP00000380068
#> [4] 69 ATCAAGTGGGGCGATGCTGGCGC...GCGTCTTCACCACCATGGAGAAG ENSP00000380067
#> [5] 69 ATCAAGTGGGGCGATGCTGGCGC...GCGTCTTCACCACCATGGAGAAG ENSP00000478864
## shows the retrieved coordinates code for the same peptide
translate(sequence)
#> AAStringSet object of length 5:
#> width seq names
#> [1] 23 IKWGDAGAEYVVESTGVFTTMEK ENSP00000229239
#> [2] 23 IKWGDAGAEYVVESTGVFTTMEK ENSP00000380070
#> [3] 23 IKWGDAGAEYVVESTGVFTTMEK ENSP00000380068
#> [4] 23 IKWGDAGAEYVVESTGVFTTMEK ENSP00000380067
#> [5] 23 IKWGDAGAEYVVESTGVFTTMEK ENSP00000478864
edb <- qr[[1]]
#> loading from cache
## let's define IRanges for the peptide
ir <- IRanges(start = 85, end = 107)
names(ir) <- "P04406"
## peptide coordinates on ENSTs (we can see two transcripts with incorrect coords)
trl <- proteinToTranscript(ir, edb, idType = "uniprot_id")
#> Fetching CDS for 1 proteins ...
#> 1 found
#> Checking CDS and protein sequence lengths ... 1/1 OK
tr <- unlist(trl, use.names = FALSE)
tr
#> IRanges object with 5 ranges and 6 metadata columns:
#> start end width | protein_id
#> <integer> <integer> <integer> | <character>
#> ENST00000229239 329 397 69 | ENSP00000229239
#> ENST00000396858 518 586 69 | ENSP00000380067
#> ENST00000396859 332 400 69 | ENSP00000380068
#> ENST00000396861 404 472 69 | ENSP00000380070
#> ENST00000619601 253 321 69 | ENSP00000478864
#> tx_id cds_ok protein_start protein_end
#> <character> <logical> <integer> <integer>
#> ENST00000229239 ENST00000229239 TRUE 85 107
#> ENST00000396858 ENST00000396858 TRUE 85 107
#> ENST00000396859 ENST00000396859 TRUE 85 107
#> ENST00000396861 ENST00000396861 TRUE 85 107
#> ENST00000619601 ENST00000619601 TRUE 85 107
#> uniprot_id
#> <character>
#> ENST00000229239 P04406
#> ENST00000396858 P04406
#> ENST00000396859 P04406
#> ENST00000396861 P04406
#> ENST00000619601 P04406
## peptide coordinates on ENSG
prt_gn <- proteinToGenome(ir, edb, idType = "uniprot_id")
#> Fetching CDS for 1 proteins ... 1 found
#> Checking CDS and protein sequence lengths ... 1/1 OK
gr <- stack(prt_gn[[1]], index.var = "stack")
gr
#> GRanges object with 7 ranges and 9 metadata columns:
#> seqnames ranges strand | stack uniprot_id
#> <Rle> <IRanges> <Rle> | <Rle> <character>
#> [1] 12 6536936-6537004 + | ENSP00000229239 P04406
#> [2] 12 6537152-6537216 + | ENSP00000380067 P04406
#> [3] 12 6537309-6537312 + | ENSP00000380067 P04406
#> [4] 12 6536936-6537004 + | ENSP00000380068 P04406
#> [5] 12 6536936-6537004 + | ENSP00000380070 P04406
#> [6] 12 6537152-6537216 + | ENSP00000478864 P04406
#> [7] 12 6537309-6537312 + | ENSP00000478864 P04406
#> tx_id protein_id exon_id exon_rank cds_ok
#> <character> <character> <character> <integer> <logical>
#> [1] ENST00000229239 ENSP00000229239 ENSE00003678358 5 TRUE
#> [2] ENST00000396858 ENSP00000380067 ENSE00003562150 5 TRUE
#> [3] ENST00000396858 ENSP00000380067 ENSE00003663529 6 TRUE
#> [4] ENST00000396859 ENSP00000380068 ENSE00003678358 4 TRUE
#> [5] ENST00000396861 ENSP00000380070 ENSE00003678358 5 TRUE
#> [6] ENST00000619601 ENSP00000478864 ENSE00003562150 4 TRUE
#> [7] ENST00000619601 ENSP00000478864 ENSE00003663529 5 TRUE
#> protein_start protein_end
#> <integer> <integer>
#> [1] 85 107
#> [2] 85 107
#> [3] 85 107
#> [4] 85 107
#> [5] 85 107
#> [6] 85 107
#> [7] 85 107
#> -------
#> seqinfo: 1 sequence from GRCh38 genome
## extract genomic sequences
seqlevelsStyle(gr) <- "UCSC"
sequence <- getSeq(Hsapiens,gr)
sequence
#> DNAStringSet object of length 7:
#> width seq
#> [1] 69 ATCAAGTGGGGCGATGCTGGCGCTGAGTACGTCGTGGAGTCCACTGGCGTCTTCACCACCATGGAGAAG
#> [2] 65 GATGCCCCCATGTTCGTCATGGGTGTGAACCATGAGAAGTATGACAACAGCCTCAAGATCATCAG
#> [3] 4 CAAT
#> [4] 69 ATCAAGTGGGGCGATGCTGGCGCTGAGTACGTCGTGGAGTCCACTGGCGTCTTCACCACCATGGAGAAG
#> [5] 69 ATCAAGTGGGGCGATGCTGGCGCTGAGTACGTCGTGGAGTCCACTGGCGTCTTCACCACCATGGAGAAG
#> [6] 65 GATGCCCCCATGTTCGTCATGGGTGTGAACCATGAGAAGTATGACAACAGCCTCAAGATCATCAG
#> [7] 4 CAAT
## confirms we have indeed retrieved incorrect coordinates for two transcripts
## with different translational start positions.
translate(sequence)
#> Warning in .Call2("DNAStringSet_translate", x, skip_code,
#> dna_codes[codon_alphabet], : in 'x[[2]]': last 2 bases were ignored
#> Warning in .Call2("DNAStringSet_translate", x, skip_code,
#> dna_codes[codon_alphabet], : in 'x[[3]]': last base was ignored
#> Warning in .Call2("DNAStringSet_translate", x, skip_code,
#> dna_codes[codon_alphabet], : in 'x[[6]]': last 2 bases were ignored
#> Warning in .Call2("DNAStringSet_translate", x, skip_code,
#> dna_codes[codon_alphabet], : in 'x[[7]]': last base was ignored
#> AAStringSet object of length 7:
#> width seq
#> [1] 23 IKWGDAGAEYVVESTGVFTTMEK
#> [2] 21 DAPMFVMGVNHEKYDNSLKII
#> [3] 1 Q
#> [4] 23 IKWGDAGAEYVVESTGVFTTMEK
#> [5] 23 IKWGDAGAEYVVESTGVFTTMEK
#> [6] 21 DAPMFVMGVNHEKYDNSLKII
#> [7] 1 Q
Thanks for this very comprehensive and detailed summary @manogenome !
So, if I understand correctly, restricting genomic mapping using canonical transcript of a gene would solve this?
Yes, this would help us correctly retrieve the genome coordinate for the peptide as we use uniprot_id and peptide coordinates as inputs. We would still need genome coordinates for the peptide on all the isoforms. Now that we know the exon id(s) for the peptide on the canonical transcript, we can simply assign (instead of recalculating) the genome coordinates to all the transcript isoforms that share this peptide's exon id(s) and drop the other isoforms. This would require some validation. But first, we will need the canonical tag (Ensembl version >= 104).
Once you have genomic coordinates you could also use the genomeToTranscript
to get it's position within all transcripts (although that will not consider or test the coding region!).
I have now a version of a EnsDb
database for Homo sapiens with an additional column tx_is_canonical
for the transcript table. Unfortunately, the example from above does not work for Ensembl release 105 because now there is no "P04406"
Uniprot ID available, but two different ones: "P04406-1"
(returns correct coordinates/peptide sequence) and "P04406-2"
(returns the 2 wrong/different peptide sequences).
With the update it would be e.g. possible to filter the EnsDb
database to only canonical transcripts and then perform the mapping only with that. Would that be a working solution for you? Note that this would then only work with Ensembl release 104 and 105 but not earlier versions. Before re-building all EnsDb
databases for these two releases I would like you to test the new functionality to see if it would be useful at all. @manogenome , please let me know and I can provide you with the updated EnsDb
(for human) and the updated ensembldb
package.
Thanks, @jorainer for adding this tag. You're right this tag will only work for newer Ensembl releases >= 104. I just cross-checked the hyperlinks for P04406-1 and P04406-2 in the below table. We can see P04406-1
is marked as canonical and matches three isoforms. And P04406-2
matches two isoforms.
I also went through the Uniprot reviewed fasta file for humans. It only had a P04406
tag for the canonical sequence. It's possible Uniprot will update this tag to P04406-1
later on. I'm happy to test the usefulness of this tag. You can share the updated EnsDb
(for humans) and the ensembldb
package with me for testing this functionality when it's ready.
You can install the updated version of ensembldb
with BiocManager::install("jorainer/ensembldb")
.
The EnsDb
with the updated annotation is availeble here (the one for Ensembl 105).
You could now restrict/subset the whole EnsDb
database to only canonical transcript with:
edb <- filter(edb, filter = ~ tx_is_canonical == 1)
Thank you, @jorainer !! The updated package, database and the new tx_is_canonical
filter are working fine. Now, I'll look at the isoform coordinates issue and get back to you.
Thanks @jorainer - will that flag be available on the database itself in the package?
The annotation is available as an additional column to the transcript database table ("tx_is_canonical"
). All EnsDb
databases for Ensembl release 105 currently available in AnnotationHub
have this additional column.
ensembldb
in addition provides a new filter (TxIsCanonicalFilter
) that allows to filter/subset an EnsDb
database by that information.
After running
proteinToGenome
, I'm observing a case where a peptide, arising from a gene with multiple isoforms, is found to have a mix of within exon and junction coordinates. Upon closer inspection, some of those coordinates turned out to be incorrect.To illustrate the issue, I calculated the genome coordinates for the peptide, "IKWGDAGAEYVVESTGVFTTMEK", from the protein, P04006 (UniProt id). This returned seven genomic ranges -- three as within exon ranges on three transcripts and the remaining four as belonging to a junction on two transcripts.
Upon extracting the genomic sequence corresponding to the above genomic ranges for the peptide and translating it, we observe that only the three within exon ranges returned the actual peptide, "IKWGDAGAEYVVESTGVFTTMEK". The junction ranges don't match the supplied peptide.
Manually inspecting IRanges returned by
proteinToTranscript
showed that the junction ranges were shifted from the actual ranges and are incorrect. I suspect the bug to be hiding somewhere in this step. Below I've added a reproducible code to demonstrate the issue using this peptide as an example.