Closed manogenome closed 3 years ago
Thanks for your report and your great example. It was really a though task.
First I like to start with the easy one, your second question. The reason for the mismatch between the number of ranges and peptides after enzymatic digestion is the argument unique = TRUE
. Because of that all duplicated cleavage results are filtered, e.g.
library("cleaver")
cleave("RRK")
#> $RRK
#> [1] "R" "K"
cleave("RRK", unique = TRUE)
#> $RRK
#> [1] "R" "K"
cleave("RRK", unique = FALSE)
#> $RRK
#> [1] "R" "R" "K"
Created on 2021-09-08 by the reprex package (v2.0.1)
With unique = FALSE
we get the numbers you were expecting:
library("cleaver")
proteins <- readAAStringSet("proteins.fasta.gz")
cleaved <- cleaver::cleave(proteins, missedCleavages = 0, enzym = "trypsin", unique = FALSE)
ranges <- cleaver::cleavageRanges(proteins, missedCleavages = 0, enzym = "trypsin")
sites <- cleaver::cleavageSites(proteins, enzym = "trypsin")
sum(lengths(cleaved))
#> [1] 23260
sum(lengths(ranges))
#> [1] 23260
sum(lengths(sites))
#> [1] 23222
sum(lengths(sites)) + length(proteins)
#> [1] 23260
Created on 2021-09-08 by the reprex package (v2.0.1)
Your first problem is the result of one of trypsin's cleavage exception rules: (see https://web.expasy.org/peptide_cutter/peptidecutter_enzymes.html)
Enzyme name | P4 | P3 | P2 | P1 | P1' | P2' |
---|---|---|---|---|---|---|
Trypsin | - | - | R | R | H or R | - |
A minimal example would be the sequence RRRK
:
library("cleaver")
cleave("RRRK", unique = FALSE)
#> $RRRK
#> [1] "R" "RR" "K"
cleavageSites("RRRK")
#> $RRRK
#> [1] 1 3
# instead of 1, 2, 3
cleavageSites(c("R", "RR", "K"))
#> $R
#> integer(0)
#>
#> $RR
#> [1] 1
#>
#> $K
#> integer(0)
Created on 2021-09-08 by the reprex package (v2.0.1)
Because of the exception rule RRR
is not cleaved but after removing R
from the start, RR
doesn't fulfill the exception anymore and could be cleaved again.
This exception is responsible for all your reported cleavage sites with more than 2 missed cleavages:
library("cleaver")
## read fasta (I filtered duplicated proteins for simplicity)
proteins <- unique(readAAStringSet("proteins.fasta.gz"))
## number of proteins in proteins.fasta
length(proteins)
#> [1] 28
## digest proteins with trypsin
cleaved <- cleaver::cleave(proteins, missedCleavages = 0:2, enzym = "trypsin")
## unlist into AAStringSet
peptides <- unlist(cleaved)
## rename individual peptides as: id::peptide
## I changed your strsplit approach because it doesn't name the peptides
## correctly.
names(peptides) <- paste(
sub("^sp\\|([^|]+)\\|.*$", "\\1", names(peptides)),
as.character(peptides),
sep = "::"
)
## get cleaved sites within peptides
missed <- cleaver::cleavageSites(peptides, enzym = "trypsin")
## number of peptides with cleavage sites > 2
length(missed[elementNROWS(missed) > 2])
#> [1] 38
## which peptides have cleavage site > 2
i <- which(elementNROWS(missed) > 2)
## get AA after cleavage site
post <- as.character(peptides[i])
## trypsin exception 4: RR|R or RR|H
post <- paste0("R", post)
nms <- sub("::.*", "", names(peptides)[i])
library("stringr")
library("crayon")
m <- logical(length(post))
s <- character(length(post))
for (i in seq_along(post)) {
prot <- as.character(proteins[[grep(nms[i], names(proteins))]])
m[i] <- grepl(post[i], prot)
## replace peptides with highlighted ones (just works in the console)
s[i] <- str_replace_all(prot, post[i], red(bold(underline(post[i]))))
cat(nms[i], ": \n", s[i], "\n---\n\n")
}
#> A6NL46 :
#> MRLCLIPWNTTPHRVLPPVVWSAPSRKKPVLSARNSMMFGHLSPVRNPRLRGKFNLQLPSLDEQVIPTRLPKMEVRAEEPKEATEVKDQVETQGQEDNKTGPCSNGKAASTSRPLETQGNLTSSWYNPRPLEGNVHLKSLTEKNQTDKAQVHAVSFYSKGHGVTSSHSPAGGILPFGKPDPLPAVLPAPVPDCSLWPEKAALKVLGKDHLPSSPGLLMVGEDMQPKDPAALRSSRSSPPRAAGHRPRKRKLSGPPLQLQQTPPLQLRWDRDEGPPPAKLPCLSPEALLVGKASQREGRLQQGNMRKNVRVLSRTSKFRRLRQLLRRRKKRWQGRRGGSRL
#> ---
#>
#> A8MUI8 :
#> MRLCLIPQNTGTPQRVLPPVVWSPPSRKKPMLSACNSMMFGHLSPVRIPHLRGKFNLQLPSLDEQVIPARLPKMEVRAEEPKEATEVKDQVETQGQEDNKRGPCSNGEAASTSSLLETQGNLTSSWYNPRPLEGNVHLKSLIEKNQTDKAQVHAVSFYSKDHEVASSHSPAGGILSFGKPDPLPTVLPAPVPGCSLWPEKAALKVLGKDHLPSSPGLLMVGEDMQPKDPAALGSSRSSPSRAASHSSHKRKLSEPPLQLQPTPPLQLKWDRDEGPPPAKFPCLSPEALLVSQASQREGRLQQGNMCKNMRVLSRTSKFRRLRELLRRRKKRRQGRCGSSHL
#> ---
#> ...
## How often we found the RRR or RRH motive:
sum(m)
#> [1] 38
mean(m)
#> [1] 1
Created on 2021-09-08 by the reprex package (v2.0.1)
If you don't like these exceptions you could set your own rules via the custom
argument:
library("cleaver")
cleave("RRRK", custom = cleaver:::rules["trypsin"], unique = FALSE)
#> $RRRK
#> [1] "R" "R" "R" "K"
Created on 2021-09-08 by the reprex package (v2.0.1)
Finally I don't think your examples are bugs but (maybe unexpected but) correct behaviour.
Feel free to reopen this issue if this doesn't solve your problem or you disagree!
Thank you @sgibb for taking the time to explore this in detail!
To add to this issue, here's a short example directly from UniProt with a protein that has a RRRP
subsequence:
suppressPackageStartupMessages(library("Biostrings"))
library("cleaver")
f <- tempfile()
download.file("https://www.uniprot.org/uniprot/P05023.fasta", f)
p <- readAAStringSet(f)
p
#> AAStringSet object of length 1:
#> width seq names
#> [1] 1023 MGKGVGRDKYEPAAVSEQGDKKG...YDEVRKLIIRRRPGGWVEKETYY sp|P05023|AT1A1_H...
subseq(p, 1010, 1023)
#> AAStringSet object of length 1:
#> width seq names
#> [1] 14 RRRPGGWVEKETYY sp|P05023|AT1A1_H...
cleave("RRRP", unique = FALSE)
#> $RRRP
#> [1] "R" "RRP"
cleave(c("R", "RRP"), unique = FALSE)
#> $R
#> [1] "R"
#>
#> $RRP
#> [1] "R" "RP"
Created on 2021-12-29 by the reprex package (v2.0.1)
When we digest proteins with a certain number of missed cleavages (0:M), the maximum number of cleavage sites per peptide is expected to be in the ranges (0, M). But for a certain number of peptides, the number of cleavage sites in it exceed the
missedCleavages
value specified in the initial digestion.In the below example case, we can see there are 78 peptides that have more than 2 cleavage sites, even though the allowed number of missed cleavages was defined as
missedCleavages=0:2
duringtrypsin
digestion.Test proteins fasta: proteins.fasta.gz
And there's also a mismatch between the number of ranges and peptides after enzymatic digestion: