Closed lgatto closed 2 years ago
The error is due to Q1MSJ5
. There exist transcripts for this UniProt identifier:
> ensembldb::transcripts(edb, filter = AnnotationFilter::UniprotFilter("Q1MSJ5"))
GRanges object with 2 ranges and 9 metadata columns:
seqnames ranges strand | tx_id
<Rle> <IRanges> <Rle> | <character>
ENST00000262210 8 67064368-67196263 + | ENST00000262210
ENST00000519668 8 67083848-67195611 + | ENST00000519668
tx_biotype tx_cds_seq_start tx_cds_seq_end
<character> <integer> <integer>
ENST00000262210 protein_coding 67064399 67195593
ENST00000519668 protein_coding 67095665 67195593
gene_id tx_support_level tx_id_version
<character> <integer> <character>
ENST00000262210 ENSG00000104218 1 ENST00000262210.9
ENST00000519668 ENSG00000104218 1 ENST00000519668.1
tx_name uniprot_id
<character> <character>
ENST00000262210 ENST00000262210 Q1MSJ5
ENST00000519668 ENST00000519668 Q1MSJ5
-------
seqinfo: 1 sequence from GRCh38 genome
But some of the ranges must be out of bounds:
> i <- which(x$names == "Q1MSJ5")
> ir <- IRanges(start = x$start[i], end = x$end[i], names = x$names[i])
> ir
IRanges object with 381 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
Q1MSJ5 1 15 15
Q1MSJ5 16 26 11
Q1MSJ5 31 36 6
Q1MSJ5 37 49 13
Q1MSJ5 57 69 13
... ... ... ...
Q1MSJ5 1137 1168 32
Q1MSJ5 1156 1172 17
Q1MSJ5 1167 1174 8
Q1MSJ5 1169 1175 7
Q1MSJ5 1173 1200 28
> ensembldb::proteinToGenome(ir, edb, idType = "uniprot_id")
Fetching CDS for 381 proteins ... 377 found
Checking CDS and protein sequence lengths ... Error in normalizeDoubleBracketSubscript(i, x, exact = exact, allow.NA = TRUE, :
subscript is out of bounds
In addition: Warning messages:
1: In ensembldb::proteinToGenome(ir, edb, idType = "uniprot_id") :
No CDS found for: Q1MSJ5, Q1MSJ5, Q1MSJ5, Q1MSJ5
2: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein.
3: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein.
4: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein.
5: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein.
I'll dig deeper and report back.
So this is actually rather weird. Elements 69 and 70 are empty (which leads to the out of bounds error - see below for details), but the error only happens when they are with a non-empty element:
> ensembldb::proteinToGenome(ir[68:70], edb, idType = "uniprot_id")
Fetching CDS for 3 proteins ... 1 found
Checking CDS and protein sequence lengths ... Error in normalizeDoubleBracketSubscript(i, x, exact = exact, allow.NA = TRUE, :
subscript is out of bounds
In addition: Warning messages:
1: In ensembldb::proteinToGenome(ir[68:70], edb, idType = "uniprot_id") :
No CDS found for: Q1MSJ5, Q1MSJ5
2: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein.
3: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein.
> ensembldb::proteinToGenome(ir[69], edb, idType = "uniprot_id")
Fetching CDS for 1 proteins ... 0 found
Checking CDS and protein sequence lengths ... 0/0 OK
$Q1MSJ5
GRanges object with 0 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
-------
seqinfo: no sequences
Warning message:
In ensembldb::proteinToGenome(ir[69], edb, idType = "uniprot_id") :
No CDS found for: Q1MSJ5
> ensembldb::proteinToGenome(ir[70], edb, idType = "uniprot_id")
Fetching CDS for 1 proteins ... 0 found
Checking CDS and protein sequence lengths ... 0/0 OK
$Q1MSJ5
GRanges object with 0 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
-------
seqinfo: no sequences
Warning message:
In ensembldb::proteinToGenome(ir[70], edb, idType = "uniprot_id") :
No CDS found for: Q1MSJ5
> ensembldb::proteinToGenome(ir[69:70], edb, idType = "uniprot_id")
Fetching CDS for 2 proteins ... 0 found
Checking CDS and protein sequence lengths ... 0/0 OK
$Q1MSJ5
GRanges object with 0 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
-------
seqinfo: no sequences
$Q1MSJ5
GRanges object with 0 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
-------
seqinfo: no sequences
Warning message:
In ensembldb::proteinToGenome(ir[69:70], edb, idType = "uniprot_id") :
No CDS found for: Q1MSJ5, Q1MSJ5
> ensembldb::proteinToGenome(ir[69:71], edb, idType = "uniprot_id")
Fetching CDS for 3 proteins ... 1 found
Checking CDS and protein sequence lengths ... Error in normalizeDoubleBracketSubscript(i, x, exact = exact, allow.NA = TRUE, :
subscript is out of bounds
In addition: Warning messages:
1: In ensembldb::proteinToGenome(ir[69:71], edb, idType = "uniprot_id") :
No CDS found for: Q1MSJ5, Q1MSJ5
2: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein.
3: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein.
So back to the error, that is thrown here:
are_ok <- unlist(lapply(cds_genome, function(z) {
if (is(z, "GRangesList"))
all(z[[1]]$cds_ok) ## error when z has no ranges
else NA
}))
Below, I debug the code and z is element 69
Browse[1]> debug at /tmp/proteinToX.R!1gB7hU#20: are_ok <- unlist(lapply(cds_genome, function(z) {
if (is(z, "GRangesList"))
all(z[[1]]$cds_ok)
else NA
}))
Browse[2]> z <- cds_genome[[2]]
Browse[2]> z
GRangesList object of length 0:
<0 elements>
Browse[2]> is(z, "GRangesList")
[1] TRUE
Browse[2]> z[[1]]
Error in normalizeDoubleBracketSubscript(i, x, exact = exact, allow.NA = TRUE, :
subscript is out of bounds
This can be fixed with
are_ok <- unlist(lapply(cds_genome, function(z) {
if (is(z, "GRangesList") & length(z) > 0)
all(z[[1]]$cds_ok)
else NA
}))
which will come as a PR in a moment (plus another minor issue related that). But it's not over...
> ensembldb::proteinToGenome(ir[68:70], edb, idType = "uniprot_id")
Fetching CDS for 3 proteins ... 1 found
Checking CDS and protein sequence lengths ... 1/3 OK
Error in `[[<-`(`*tmp*`, name, value = 1201L) :
1 elements in value to replace 0 elements
In addition: Warning messages:
1: In ensembldb::proteinToGenome(ir[68:70], edb, idType = "uniprot_id") :
No CDS found for: Q1MSJ5, Q1MSJ5
2: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein.
3: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein.
Will continue to investigate and report.
Coming back to the oddity above, as to why the code worked with only empty GRangesList
elements... that because the cds_genome
list would be made of NULL
s only:
cds_genome
$Q1MSJ5
NULL
$Q1MSJ5
NULL
which would, as anticipated, graciously produce
are_ok
Q1MSJ5 Q1MSJ5
NA NA
Continuing my stroll along the proteinToGenome()
path, the new errors happens here:
res <- mapply(
cds_genome, as(coords_cds, "IRangesList"), as(x, "IRangesList"),
FUN = function(gnm, cds, prt) {
if (is.null(gnm)) {
GRanges()
} else {
## Unlist because we'd like to have a GRanges here. Will split
## again later.
maps <- unlist(.to_genome(gnm, cds))
## Don't want to have GRanges names!
names(maps) <- NULL
mcols(maps)$protein_start <- start(prt) ## ERROR
mcols(maps)$protein_end <- end(prt) ## ERROR
maps[order(maps$exon_rank)]
}
})
Given that we have an empty GRangesList
in cds_genome
and gnm
in the mapply()
call, maps
also becomes an empty GRanges
and trying to set any mcols
produces the (now obvious)
Error in `[[<-`(`*tmp*`, name, value = 1) :
1 elements in value to replace 0 elements
This can be fixes with an additional condition in the test (PR to follow):
...
if (is.null(gnm) | !length(gnm)) {
GRanges()
} else {
...
> ensembldb::proteinToGenome(ir[68:71], edb, idType = "uniprot_id")
Fetching CDS for 4 proteins ... 2 found
Checking CDS and protein sequence lengths ... 2/4 OK
$Q1MSJ5
GRanges object with 1 range and 8 metadata columns:
seqnames ranges strand | uniprot_id tx_id
<Rle> <IRanges> <Rle> | <character> <character>
[1] 8 67195453-67195527 + | Q1MSJ5 ENST00000262210
protein_id exon_id exon_rank cds_ok protein_start
<character> <character> <integer> <logical> <integer>
[1] ENSP00000262210 ENSE00002101366 29 TRUE 1176
protein_end
<integer>
[1] 1200
-------
seqinfo: 1 sequence from GRCh38 genome
$Q1MSJ5
GRanges object with 0 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
-------
seqinfo: no sequences
$Q1MSJ5
GRanges object with 0 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
-------
seqinfo: no sequences
$Q1MSJ5
GRangesList object of length 2:
$ENSP00000262210
GRanges object with 1 range and 8 metadata columns:
seqnames ranges strand | uniprot_id tx_id
<Rle> <IRanges> <Rle> | <character> <character>
[1] 8 67064399-67064476 + | Q1MSJ5 ENST00000262210
protein_id exon_id exon_rank cds_ok protein_start
<character> <character> <integer> <logical> <integer>
[1] ENSP00000262210 ENSE00003513069 1 TRUE 1
protein_end
<integer>
[1] 26
-------
seqinfo: 1 sequence from GRCh38 genome
$ENSP00000430092
GRanges object with 2 ranges and 8 metadata columns:
seqnames ranges strand | uniprot_id tx_id
<Rle> <IRanges> <Rle> | <character> <character>
[1] 8 67095665-67095732 + | Q1MSJ5 ENST00000519668
[2] 8 67103037-67103046 + | Q1MSJ5 ENST00000519668
protein_id exon_id exon_rank cds_ok protein_start
<character> <character> <integer> <logical> <integer>
[1] ENSP00000430092 ENSE00003677216 4 TRUE 1
[2] ENSP00000430092 ENSE00001284106 5 TRUE 1
protein_end
<integer>
[1] 26
[2] 26
-------
seqinfo: 1 sequence from GRCh38 genome
Warning messages:
1: In ensembldb::proteinToGenome(ir[68:71], edb, idType = "uniprot_id") :
No CDS found for: Q1MSJ5, Q1MSJ5
2: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein.
3: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein.
> ensembldb::proteinToGenome(ir[68:70], edb, idType = "uniprot_id")
Fetching CDS for 3 proteins ... 1 found
Checking CDS and protein sequence lengths ... 1/3 OK
$Q1MSJ5
GRanges object with 1 range and 8 metadata columns:
seqnames ranges strand | uniprot_id tx_id
<Rle> <IRanges> <Rle> | <character> <character>
[1] 8 67195453-67195527 + | Q1MSJ5 ENST00000262210
protein_id exon_id exon_rank cds_ok protein_start
<character> <character> <integer> <logical> <integer>
[1] ENSP00000262210 ENSE00002101366 29 TRUE 1176
protein_end
<integer>
[1] 1200
-------
seqinfo: 1 sequence from GRCh38 genome
$Q1MSJ5
GRanges object with 0 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
-------
seqinfo: no sequences
$Q1MSJ5
GRanges object with 0 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
-------
seqinfo: no sequences
Warning messages:
1: In ensembldb::proteinToGenome(ir[68:70], edb, idType = "uniprot_id") :
No CDS found for: Q1MSJ5, Q1MSJ5
2: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein.
3: Could not find a CDS whith the expected length for protein: 'Q1MSJ5'. The returned genomic coordinates might thus not be correct for this protein.
What I'm worried about is that the originally empty GRanges[List]
elements aren't consistent with the others, given that they miss the mcols
, but I assume this is something that should be addressed downstream, not here.
Thanks for the detailed report @lgatto ! The main reason seems to me that there's an issue between the CDS of the transcripts and the protein sequence. That is something I have already observed and that I can not really explain. This is also the main reason for this additional check I'm performing that compares the length of the CDS with the length of the protein sequence - if they don't match you'll get the Could not find a CDS whith the expected length for protein warnings - and in this cases it can well be that the peptide sequence is within the protein sequence but, because the CDS does not match the protein sequence, outside of the CDS (and the CDS is used to map coordinates to the genome).
Note that there are two different types of Uniprot-to-Ensembl protein mappings (info provided by Ensembl):
> listUniprotMappingTypes(edb)
[1] "DIRECT" "SEQUENCE_MATCH"
Maybe worth to also try filtering the EnsDb
on one of those to see if it reduces the number of mismatches between CDS and protein sequences (or some of the problems)?
edb <- filter(edb, filter = ~ uniprot_mapping_type == "DIRECT")
There's also the issue documented in #123 where some of the peptide coordinates along the UniProt proteins might actually not be mapped to the ENSEMBL canonical transcripts.
Thanks for the uniprot mapping types hint!
I get the following error where start/end/names indicate start/end ranges of peptides along UnitProt identifiers (the same as discussed over slack yesterday) and
edb
isIndeed, I confirm the first warning:
Could you help with the error, @jorainer