sgibb / cleaver

Cleavage of polypeptide sequences
http://sgibb.github.io/cleaver/
12 stars 3 forks source link

trypsin protease and PLGS cleavage rules #4

Closed pavel-shliaha closed 10 years ago

pavel-shliaha commented 10 years ago

It has previously been demonstrated that trypsin has digestion problems if the AA in the vicinity of the K|R in the cleavage sites are

1) P; trypsin is incapable of cutting if P is in +1 position. Arguable (ref1) 2) acidic amino acids in the vicinity of cleavage site inhibit trypsin activity but not to 100%.(ref2) 3) trypsin cannot cut if the cleavage site is flanked by 1 amino acid. As a result staggered end will result in two dead end products, i.e. XXXXBBYYYY will result in both XXXXBB + YYYY and XXXXB + BYYYY. (ref3) 4) trypsin is reported to have little capacity as DIPEPTYDIL PEPTIDASE, but can function as PEPTYDIL DIPEPTYDASE. It means it can cut when 2 amino acids are present on C terminus, but cannot when two amino acids are -resent on N terminus. Hence XBYYYYY is dead-end product, but XXXXXBY appears to be cut again to yeild XXXXX (ref3)

The PLGS takes into consideration some of these rules and allows missed cleaved peptides to be pepFrag1 or pepFrag2 (i.e. suitable for quantitation), i.f. the clevage site is followed by P, K, R, D, E. Note the rule only applies to the amino acid directly following K|R, although D and E seem to have an inhibitory effect on trypsin activity at positions -3:+3.

my suggestion is to allow the peptides with "special" missed cleavages for quantitation. I.e. when creating a vector of proteotypic peptides both peptides with sequence XXXXK and XXXXKEXXXR should be present.

references:

  1. Does trypsin cut before proline?
  2. Large-Scale Quantitative Assessment of Different In-Solution Protein Digestion Protocols Reveals Superior Cleavage Efficiency of Tandem Lys-C/Trypsin Proteolysis over Trypsin Digestion.
  3. The importance of the digest: proteolysis and absolute quantification in proteomics. http://dx.doi.org/10.1016/j.ymeth.2011.05.005
pavel-shliaha commented 10 years ago

distribution of amino acids following the K|R in the cleavage site in PLGS output

D E K P R
216 208 167 314 68

distribution of amino acids preceeding the K|R in the cleavage site in PLGS output (no apparent consensus)

A C D E F G H I K L M N P Q R S T V W Y
55 9 44 65 60 98 2 84 13 131 25 39 47 37 12 89 51 72 13 27
sgibb commented 10 years ago

On 2014-04-28 11:48:40, notifications@github.com wrote:

1) P; trypsin is incapable of cutting if P is in +1 position. Arguable (ref1) cleaver know this already:

cleave(c("XXXXKZYYYY", "XXXXKPYYYY"))
$XXXXKZYYYY
[1] "XXXXK" "ZYYYY"

$XXXXKPYYYY
[1] "XXXXKPYYYY"

2) acidic amino acids in the vicinity of cleavage site inhibit trypsin activity but not to 100%.(ref2) Ok, I think we could address this with the missedCleavages argument, e.g.:

cleave(c("XXXXKZYYYY"), missedCleavages=0:1)
$XXXXKZYYYY
[1] "XXXXK"      "ZYYYY"      "XXXXKZYYYY"

3) trypsin cannot cut if the cleavage site is flanked by 1 amino acid. As a result staggered end will result in two dead end products, i.e. XXXXBBYYYY will result in both XXXXBB + YYYY and XXXXB + BYYYY. (ref3)

We can solve this in two ways:

  1. using missedCleavages
  2. change the trypsin rule, but in that case cleaver's rules will differ from its archetype: peptidecutter
## solution 1
cleave(c("XXXXKKYYYY"), missedCleavages=0:1)
$XXXXKKYYYY
[1] "XXXXK"  "K"      "YYYY"   "XXXXKK" "KYYYY"
## solution 2
cleave(c("XXXXKKYYYY"), custom="K(?=[^P]).")
$XXXXKKYYYY
[1] "XXXXK" "KYYYY"

@pavel-shliaha which solution do you prefer?

4) trypsin is reported to have little capacity as DIPEPTYDIL PEPTIDASE, but can function as PEPTYDIL DIPEPTYDASE. It means it can cut when 2 amino acids are present on C terminus, but cannot when two amino acids are -resent on N terminus. Hence XBYYYYY is dead-end product, but XXXXXBY appears to be cut again to yeild XXXXX (ref3)

cleave(c("XKYYYY", "XXXXKY"), missedCleavages=0:1)
$XKYYYY
[1] "XK"     "YYYY"   "XKYYYY"

$XXXXKY
[1] "XXXXK"  "Y"      "XXXXKY"

The PLGS takes into consideration some of these rules and allows missed cleaved peptides to be pepFrag1 or pepFrag2 (i.e. suitable for quantitation), i.f. the clevage site is followed by P, K, R, D, E. Note the rule only applies to the amino acid directly following K|R, although D and E seem to have an inhibitory effect on trypsin activity at positions -3:+3.

my suggestion is to allow the peptides with "special" missed cleavages for quantitation. I.e. when creating a vector of proteotypic peptides both peptides with sequence XXXXK and XXXXKEXXXR should be present.

cleave("XXXXKEXXR", missedCleavages=0:1)
$XXXXKEXXR
[1] "XXXXK"     "EXXR"      "XXXXKEXXR"

And of course taking mis-cleavages into consideration was slow (@pavel-shliaha thanks for finding it!). I just fixed the calculation of mis-cleavages (removing a lot of double calculated possibilties) and now it is fast, even with missedCleavages=0:7 :wink:

system.time(cleave(readAAStringSet("cleaverbug/bug/S.cerevisiae_Uniprot_reference_canonical_18_03_14.fasta"), missedCleavages=0:7))
   user  system elapsed
 32.250   0.024  32.367

Please note that this version is not on Bioconductor yet (I have to backport the fix).

Best wishes,

Sebastian

references:

  1. Does trypsin cut before proline? 2 Large-Scale Quantitative Assessment of Different In-Solution Protein Digestion Protocols Reveals Superior Cleavage Efficiency of Tandem Lys-C/Trypsin Proteolysis over Trypsin Digestion.
  2. The importance of the digest: proteolysis and absolute quantification in proteomics.

Reply to this email directly or view it on GitHub: https://github.com/sgibb/cleaver/issues/4

sgibb commented 10 years ago

see https://github.com/lgatto/synapter/pull/57