sgibb / cleaver

Cleavage of polypeptide sequences
http://sgibb.github.io/cleaver/
12 stars 3 forks source link

In some cleaved peptides, number of cleavage sites > number of defined missed cleavages #7

Closed manogenome closed 3 years ago

manogenome commented 3 years ago

When we digest proteins with a certain number of missed cleavages (0:M), the maximum number of cleavage sites per peptide is expected to be in the ranges (0, M). But for a certain number of peptides, the number of cleavage sites in it exceed the missedCleavages value specified in the initial digestion.

In the below example case, we can see there are 78 peptides that have more than 2 cleavage sites, even though the allowed number of missed cleavages was defined as missedCleavages=0:2 during trypsin digestion.

Test proteins fasta: proteins.fasta.gz

library(cleaver)

## read fasta
proteins <- readAAStringSet("proteins.fasta.gz")

## number of proteins in proteins.fasta
length(proteins)
## [1] 38

## digest proteins with trypsin
cleaved <- cleaver::cleave(proteins, missedCleavages = 0:2, enzym = "trypsin")

## unlist into AAStringSet
peptides <- unlist(cleaved)

## rename individual peptides as: id::peptide
names(peptides) <- paste0(base::strsplit(names(cleaved), "\\|")[[1]][2], 
                          "::", as.character(peptides))

## get cleaved sites within peptides
missed <- cleaver::cleavageSites(peptides, enzym = "trypsin")

## number of peptides with cleavage sites > 2
length(missed[elementNROWS(missed) > 2])
## [1] 78

## peptides with more with cleavage sites > 2
head(missed[elementNROWS(missed) > 2])
## $`A6NL46::RRKK`
[1] 1 2 3

$`A6NL46::RRKK`
[1] 1 2 3

$`A6NL46::RRAVSMDNGAKFLR`
[1]  1  2 11

$`A6NL46::RRPMIYVESSEESSDEQPDEVESPTQSQDSTPAEEREDEGASAAQGQEPEADSQELVQPKTGCELGDGPDTK`
[1]  1 36 60

$`A6NL46::RRQEGKCK`
[1] 1 2 6

$`A6NL46::RRGSSIPQFTNSPTMVIMVGLPARGK`
[1]  1  2 24

And there's also a mismatch between the number of ranges and peptides after enzymatic digestion:

cleaved <- cleaver::cleave(proteins, missedCleavages = 0, enzym = "trypsin")
ranges <- cleaver::cleavageRanges(proteins, missedCleavages = 0, enzym = "trypsin")
sites <- cleaver::cleavageSites(proteins, enzym = "trypsin")

sum(lengths(cleaved))
## [1] 17072
sum(lengths(ranges) )
## [1] 23260
sum(lengths(sites))
## [1] 23222 
sum(lengths(sites)) + length(proteins)
## [1] 23260

peptides <- unlist(cleaved)
names(peptides) <- paste0(base::strsplit(names(cleaved), "\\|")[[1]][2], 
                          "::", as.character(peptides))
missed <- cleaver::cleavageSites(peptides, enzym = "trypsin")
length(missed[elementNROWS(missed) > 0])
## 55
sgibb commented 3 years ago

Thanks for your report and your great example. It was really a though task.

First I like to start with the easy one, your second question. The reason for the mismatch between the number of ranges and peptides after enzymatic digestion is the argument unique = TRUE. Because of that all duplicated cleavage results are filtered, e.g.

library("cleaver")

cleave("RRK")
#> $RRK
#> [1] "R" "K"
cleave("RRK", unique = TRUE)
#> $RRK
#> [1] "R" "K"
cleave("RRK", unique = FALSE)
#> $RRK
#> [1] "R" "R" "K"

Created on 2021-09-08 by the reprex package (v2.0.1)

With unique = FALSE we get the numbers you were expecting:

library("cleaver")

proteins <- readAAStringSet("proteins.fasta.gz")

cleaved <- cleaver::cleave(proteins, missedCleavages = 0, enzym = "trypsin", unique = FALSE)
ranges <- cleaver::cleavageRanges(proteins, missedCleavages = 0, enzym = "trypsin")
sites <- cleaver::cleavageSites(proteins, enzym = "trypsin")

sum(lengths(cleaved))
#> [1] 23260
sum(lengths(ranges))
#> [1] 23260
sum(lengths(sites))
#> [1] 23222
sum(lengths(sites)) + length(proteins)
#> [1] 23260

Created on 2021-09-08 by the reprex package (v2.0.1)


Your first problem is the result of one of trypsin's cleavage exception rules: (see https://web.expasy.org/peptide_cutter/peptidecutter_enzymes.html)

Enzyme name P4 P3 P2 P1 P1' P2'
Trypsin - - R R H or R -

A minimal example would be the sequence RRRK:

library("cleaver")

cleave("RRRK", unique = FALSE)
#> $RRRK
#> [1] "R"  "RR" "K"

cleavageSites("RRRK")
#> $RRRK
#> [1] 1 3
# instead of 1, 2, 3
cleavageSites(c("R", "RR", "K"))
#> $R
#> integer(0)
#> 
#> $RR
#> [1] 1
#> 
#> $K
#> integer(0)

Created on 2021-09-08 by the reprex package (v2.0.1)

Because of the exception rule RRR is not cleaved but after removing R from the start, RR doesn't fulfill the exception anymore and could be cleaved again.

This exception is responsible for all your reported cleavage sites with more than 2 missed cleavages:

library("cleaver")

## read fasta (I filtered duplicated proteins for simplicity)
proteins <- unique(readAAStringSet("proteins.fasta.gz"))

## number of proteins in proteins.fasta
length(proteins)
#> [1] 28

## digest proteins with trypsin
cleaved <- cleaver::cleave(proteins, missedCleavages = 0:2, enzym = "trypsin")

## unlist into AAStringSet
peptides <- unlist(cleaved)

## rename individual peptides as: id::peptide
## I changed your strsplit approach because it doesn't name the peptides
## correctly.
names(peptides) <- paste(
    sub("^sp\\|([^|]+)\\|.*$", "\\1", names(peptides)),
    as.character(peptides),
    sep = "::"
)

## get cleaved sites within peptides
missed <- cleaver::cleavageSites(peptides, enzym = "trypsin")

## number of peptides with cleavage sites > 2
length(missed[elementNROWS(missed) > 2])
#> [1] 38

## which peptides have cleavage site > 2
i <- which(elementNROWS(missed) > 2)

## get AA after cleavage site
post <- as.character(peptides[i])

## trypsin exception 4: RR|R or RR|H
post <- paste0("R", post)
nms <- sub("::.*", "", names(peptides)[i])

library("stringr")
library("crayon")

m <- logical(length(post))
s <- character(length(post))

for (i in seq_along(post)) {
    prot <- as.character(proteins[[grep(nms[i], names(proteins))]])
    m[i] <- grepl(post[i], prot)
    ## replace peptides with highlighted ones (just works in the console)
    s[i] <- str_replace_all(prot, post[i], red(bold(underline(post[i]))))
    cat(nms[i], ": \n", s[i], "\n---\n\n")
}
#> A6NL46 : 
#>  MRLCLIPWNTTPHRVLPPVVWSAPSRKKPVLSARNSMMFGHLSPVRNPRLRGKFNLQLPSLDEQVIPTRLPKMEVRAEEPKEATEVKDQVETQGQEDNKTGPCSNGKAASTSRPLETQGNLTSSWYNPRPLEGNVHLKSLTEKNQTDKAQVHAVSFYSKGHGVTSSHSPAGGILPFGKPDPLPAVLPAPVPDCSLWPEKAALKVLGKDHLPSSPGLLMVGEDMQPKDPAALRSSRSSPPRAAGHRPRKRKLSGPPLQLQQTPPLQLRWDRDEGPPPAKLPCLSPEALLVGKASQREGRLQQGNMRKNVRVLSRTSKFRRLRQLLRRRKKRWQGRRGGSRL 
#> ---
#> 
#> A8MUI8 : 
#>  MRLCLIPQNTGTPQRVLPPVVWSPPSRKKPMLSACNSMMFGHLSPVRIPHLRGKFNLQLPSLDEQVIPARLPKMEVRAEEPKEATEVKDQVETQGQEDNKRGPCSNGEAASTSSLLETQGNLTSSWYNPRPLEGNVHLKSLIEKNQTDKAQVHAVSFYSKDHEVASSHSPAGGILSFGKPDPLPTVLPAPVPGCSLWPEKAALKVLGKDHLPSSPGLLMVGEDMQPKDPAALGSSRSSPSRAASHSSHKRKLSEPPLQLQPTPPLQLKWDRDEGPPPAKFPCLSPEALLVSQASQREGRLQQGNMCKNMRVLSRTSKFRRLRELLRRRKKRRQGRCGSSHL 
#> ---
#> ... 

## How often we found the RRR or RRH motive:
sum(m)
#> [1] 38
mean(m)
#> [1] 1

Created on 2021-09-08 by the reprex package (v2.0.1)

If you don't like these exceptions you could set your own rules via the custom argument:

library("cleaver")

cleave("RRRK", custom = cleaver:::rules["trypsin"], unique = FALSE)
#> $RRRK
#> [1] "R" "R" "R" "K"

Created on 2021-09-08 by the reprex package (v2.0.1)

Finally I don't think your examples are bugs but (maybe unexpected but) correct behaviour.

Feel free to reopen this issue if this doesn't solve your problem or you disagree!

lgatto commented 3 years ago

Thank you @sgibb for taking the time to explore this in detail!

lgatto commented 2 years ago

To add to this issue, here's a short example directly from UniProt with a protein that has a RRRP subsequence:

suppressPackageStartupMessages(library("Biostrings"))
library("cleaver")
f <- tempfile()
download.file("https://www.uniprot.org/uniprot/P05023.fasta", f)
p <- readAAStringSet(f)
p
#> AAStringSet object of length 1:
#>     width seq                                               names               
#> [1]  1023 MGKGVGRDKYEPAAVSEQGDKKG...YDEVRKLIIRRRPGGWVEKETYY sp|P05023|AT1A1_H...
subseq(p, 1010, 1023)
#> AAStringSet object of length 1:
#>     width seq                                               names               
#> [1]    14 RRRPGGWVEKETYY                                    sp|P05023|AT1A1_H...
cleave("RRRP", unique = FALSE)
#> $RRRP
#> [1] "R"   "RRP"
cleave(c("R", "RRP"), unique = FALSE)
#> $R
#> [1] "R"
#> 
#> $RRP
#> [1] "R"  "RP"

Created on 2021-12-29 by the reprex package (v2.0.1)