Specifying the enzyme rules used in the search

GoogleCodeExporter commented 8 years ago

General discussion about enzymes:

- The simple case rule is cleave at x, y or z unless the adjacent residue
is a, b or c
- An enzyme can cleave either side of a residue (normally known as cterm or
nterm)
- Most search engines allow the number of missed cleavages to be specified.
- Many search engines allow 'semi-specific' - i.e. one terminus must cleave
according to the rules, the other can cleave at any residue.
- Most search engines allow a 'no enzyme' option.
- Some search engines allow a minimum distance between cleavage sites
- Some search engines allow more than one enzymes to be specified.
- If more than one enzyme, they can be applied to separate aliquots which
are then mixed, or they can be applied 'together'. (If separate aliquots,
then a peptide cannot be cleaved at one terminus by one enzyme, and the
other by a different enzyme)
- There are no standard names or definitions
- There are more complex enzymes. For example CNBr(cyanogen bromide) is
unusual in that it cleaves on the C-terminal side of methionine, converting
it to a homoserine.

Original issue reported on code.google.com by dcre...@gmail.com on 17 Jun 2008 at 4:40

GoogleCodeExporter commented 8 years ago

Comment from Sean Seymour, 07 May 2008 18:29:

Hi all,

I think having a standard/aligned way to indicate where cleavages are allowed or
forbidden is a great idea. Just don't make it required content - we don't do it 
like
that in some search modes. Alternatively, we could expand the syntax, but I 
think it
would be much easier not to and we'll push our settings out as Paragon-specific 
CV.

Sean 

and Pierre-Alain:
Hi all,
same might be true for Phenyx, where rules might be as complex as regular 
expressions
allow them to be...

Original comment by dcre...@gmail.com on 17 Jun 2008 at 4:43

Added labels: Milestone-Release1.0

GoogleCodeExporter commented 8 years ago

Example from Phenyx:
        <cleavEnzymes>
          <oneCleavEnzyme name="Trypsin_(KR_noP)" owner="default">
            <site>
              <cleavSite>KR</cleavSite>
              <adjacentSite>^P</adjacentSite>
              <terminus>C</terminus>
            </site>
            <CTermGain>OH</CTermGain>
            <NTermGain>H</NTermGain>
          </oneCleavEnzyme>

and using regex:
          <oneCleavEnzyme name="Trypsin_regexp">
            <siteRegexp><![CDATA[(?<=[KR])(?=[^P])]]></siteRegexp>
            <terminus>C</terminus>
            <CTermGain>OH</CTermGain>
            <NTermGain>H</NTermGain>
          </oneCleavEnzyme>

Original comment by dcre...@gmail.com on 17 Jun 2008 at 4:47

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

Dual enzyme sxample from Mascot (not XML):

Title:LysC+AspN
Cleavage[0]:K
Restrict[0]:P
Cterm[0]
Cleavage[1]:DB
Nterm[1]
Independent:1

Mascot semi trypsin:
Title:semiTrypsin
Cleavage[0]:KR
Restrict[0]:P
Cterm[0]
SemiSpecific:1

Original comment by dcre...@gmail.com on 17 Jun 2008 at 4:52

GoogleCodeExporter commented 8 years ago

Discussion on TPP:

http://groups.google.com/group/spctools-discuss/browse_thread/thread/d31ff28280f
0e46e/7f240fdc6a9fd1a8?lnk=gst&q=This+looks+like+a+great#7f240fdc6a9fd1a8

"This looks like a great grant proposal for more robust enzyme handling."

i.e. we are not alone in thinking that we can solve this by a simple regex or 
two...

Original comment by dcre...@gmail.com on 17 Jun 2008 at 4:57

GoogleCodeExporter commented 8 years ago

I propose something simple and rather similar to the first Phenyx case:

<Enzyme name="Trypsin" semiSpecific="0" missedCleavages="2">
  <site>
    <cleaveSite>KR</cleaveSite>
    <noCleaveSite>P</noCleaveSite>
    <terminus>C</terminus>
  </site>
  <CTermGain>OH</CTermGain>
  <NTermGain>H</NTermGain>
</Enzyme>

<Enzyme name="LysC+AspN" semiSpecific="0" missedCleavages="1" independent="1">
  <site>
    <cleaveSite>K</cleaveSite>
    <noCleaveSite>P</noCleaveSite>
    <terminus>C</terminus>
  </site>
  <site>
    <cleaveSite>DB</cleaveSite>
    <noCleaveSite></noCleaveSite>
    <terminus>N</terminus>    
  </site>
  <CTermGain>OH</CTermGain>
  <NTermGain>H</NTermGain>
</Enzyme>

However, <site> shouldn't be required and we should be able to use CV instead 
for
cases that don't follow these simple rules.

Original comment by dcre...@gmail.com on 17 Jun 2008 at 5:12

GoogleCodeExporter commented 8 years ago

probably just a nomenclature thing, but I've always felt the <cleaveSite>
<noCleaveSite> combined with <terminus> is a bit confusing. The naming scheme
used by biochemists is the Schechter and Berger notation, which always has
cleavage at the peptide bond as the central reference point - this makes
the question of "either side" irrelevant.

Schechter and Berger looks like this:

(non-prime side)...Nterm-P3 -P2 -P1 -|-P1P-P2P-P3P-C-term...(prime side)
                ...Nterm-Aaa-Aaa-Aaa-|-Aaa-Aaa-Aaa-C-term...

so the cut is always between P1 and P1' (prime) at what is referred to as the
scissile bond, so if we used something like:

<P1cleave>KR</P1cleave>
<P1Pnoncleave>P<P1Pnoncleave>

for trypsin.

Then this allows us higher flexibility and one less element to worry about,
plus it matches up to the biochemists expectation.

There may well be good reasons why the search engine developers don't do it
this way, but thought I'd put this out as a suggestion

-Simon-

Original comment by i.am.sim...@gmail.com on 19 Jun 2008 at 12:14

GoogleCodeExporter commented 8 years ago

Sorry for the long delay before replying.
Simon, one problem with this approach is that we would even need more tags this 
way?
 <P1cleave>
 <P1Pcleave>
 <P2cleave>
 <P2Pcleave>
 <P1nocleave>
 <P1Pnocleave>
 <P2nocleave>
 <P2Pnocleave>
etc.
Also, as you say, it isn't the way search engines generally express it, so 
might be a
little 'foreign' and off putting?
I'm becoming keener on the regular expression approach because it does (almost) 
all
we want. I'll put two alternatives in the next two separate comments, so it is 
easier
to refer to them.

David

Original comment by dcre...@gmail.com on 23 Jul 2008 at 4:16

GoogleCodeExporter commented 8 years ago

The suggestion from Pierre-Alain in comment #2 is to use Perl regular 
expressions,
which may not be so clear to many people. Particularly since these are "Extended
Patterns" in perl regex... 
For Trypsin, for example, it is
(?<=[KR])(?=[^P])

The ?<= is a "zero-width positive look-behind assertion", and the [] means one 
of
this character set. So, this rule is to look behind for a K or R

?=  is a zero-width positive look-ahead assertion, and [^P] means any character 
that
is not P.

http://perldoc.perl.org/perlre.html

An example of a few lines of perl:
 $protein = "ABCDKPEFGHIJKLMNOPQRSTUVWXYZ";
 @peptides = split(/(?<=[KR])(?=[^P])/, $protein);
 print join "\n", @peptides;

gives:
ABCDKPEFGHIJK
LMNOPQR
STUVWXYZ

So, this option would be:

<SpectrumIdentificationProtocol ...
  <Enzyme>
    <Rules><![CDATA[(?<=[KR])(?=[^P])]]></Rules>
    <optional CV>
  </Enzyme>

These expressions are very powerful and can do pretty much anything that we 
want.

btw, is obviously optional as to whether you use 'CDATA' or not - it does make 
it
easier for humans to read because there's no need to escape the < and &. 
Everything
inside a CDATA section is ignored by the parser.
A CDATA section starts with "<![CDATA[" and ends with "]]>":

Is this all too obscure for most people?

Original comment by dcre...@gmail.com on 23 Jul 2008 at 4:18

GoogleCodeExporter commented 8 years ago

A less 'perly' way to do this would be to use what I suggested in comment #6, 
except
that we could use (simple!) regular expressions for the cleaveSite and 
noCleaveSite
elements.

Original comment by dcre...@gmail.com on 23 Jul 2008 at 4:21

GoogleCodeExporter commented 8 years ago

In response to David's comment #8, actually, the notation I propose does not
require all these extra lines as suggested. It just requires the two I 
originally
gave (for trypsin at least).  But my main gripe with the original suggestion to 
use
<CleaveSite> and <noCleaveSite> was that it doesn't tell you where the cleave
actually is! The P1-P1' notation does - its fixed. The same criticism could be
levelled at the regex approach - we would still need to explicitly define where 
the
cleavage is in some way. That could easily be done of course, so perhaps someone
could think of a way to do it? 

I'm generally for the use of regex by the way, but we do need to ensure we
specify where the cleavage is in an intuitive way 

-Simon-

Also, if a biochemist has a to generate a new

Original comment by i.am.sim...@gmail.com on 25 Jul 2008 at 2:50

GoogleCodeExporter commented 8 years ago

Simon, Sorry, I must be missing something here then. How would you specify Asp-N
(cleaves at DB, Nterm)?
And how would you specify Caspase (Cterm, 3 possible sequence patterns: DEVD or 
DQTD
or ELPD)?

David

Original comment by dcre...@gmail.com on 29 Jul 2008 at 9:55

GoogleCodeExporter commented 8 years ago

In response to David.

Asp-N would simply be:

<P1Pcleave>DB</P1Pcleave>

Caspases would require a few more lines, and coping with the concept of "or"
means introducing some new notation. Regexs would be more elegant here I
suppose. But it can still be done.

<P4cleave>D</P4cleave>
<P3cleave>E</P3cleave>
<P2cleave>V</P2cleave>
<P1cleave>D</P1cleave>

the above covers the first example. Is there really a single Caspase that
cleaves DEVD or DQTD or ELPD explicitly ? I'm not sure there is, and its
an ongoing debate in the literature to define a lot of caspase specificities
isn't it?

So why do I keep going on about this? Its largely because this notation 
parallels the biochemistry. It might not be high on the agenda for designing
a data model for exchange for some people, but I think it should be, especially
as the pre-existing one is widely used in all the literature and can be
found in all the standard textbooks. 

-Simon-

Original comment by i.am.sim...@gmail.com on 29 Jul 2008 at 1:30

GoogleCodeExporter commented 8 years ago

So my list of available element names (in the schema) in #8 is pretty much 
correct? 

> The same criticism could be levelled at the regex approach - 
> we would still need to explicitly define where the cleavage is in some way. 
If you look (very!) carefully at #9, the regex format does describe this. The 
problem
with this approach is (I think) that it is too obscure. If it's not obvious to 
you,
it's not going to be obvious to most people.

I still don't have a real preference, but am becoming keener on your 
suggestion. If
we can model multiple enzymes:
> If more than one enzyme, they can be applied to separate aliquots which
> are then mixed, or they can be applied 'together'. (If separate aliquots,
> then a peptide cannot be cleaved at one terminus by one enzyme, and the
> other by a different enzyme)
How about something like (for Trypsin and Asp-N applied in separate aliquots):

<enzymes independent="1" missedCleavages="2" semiSpecific="0" minDistance="4">
  <enzyme name="Trypsin">
    <P1cleave>KR</P1cleave>
    <P1Pnoncleave>P<P1Pnoncleave>
  </enzyme>
  <enzyme name="Asp-N">
    <P1Pcleave>DB</P1Pcleave>
  </enzyme>
</enzymes>

then this could also be used for the (dubious) Caspase case.
<enzymes independent="0" missedCleavages="1" semiSpecific="0" minDistance="4">
  <enzyme name="Caspase1">
    <P4cleave>D</P4cleave>
    <P3cleave>E</P3cleave>
    <P2cleave>V</P2cleave>
    <P1cleave>D</P1cleave>
  </enzyme>
  <enzyme name="Caspase2">
    <P4cleave>D</P4cleave>
    <P3cleave>Q</P3cleave>
    <P2cleave>D</P2cleave>
    <P1cleave>T</P1cleave>
  </enzyme>
  <enzyme name="Caspase3">
    <P4cleave>E</P4cleave>
    <P3cleave>L</P3cleave>
    <P2cleave>P</P2cleave>
    <P1cleave>D</P1cleave>
  </enzyme>
</enzymes>

David

Original comment by dcre...@gmail.com on 29 Jul 2008 at 2:22

GoogleCodeExporter commented 8 years ago

Examples for suggestion #10

<Enzymes name="LysC+AspN" semiSpecific="0" missedCleavages="1" independent="1"
minDistance="4">
  <enzyme>
    <cleaveSite>K</cleaveSite>
    <noCleaveSite>P</noCleaveSite>
    <terminus>C</terminus>
  </enzyme>
  <enzyme>
    <cleaveSite>[DB]</cleaveSite>
    <noCleaveSite></noCleaveSite>
    <terminus>N</terminus>    
  </enzyme>
  <CTermGain>OH</CTermGain>
  <NTermGain>H</NTermGain>
</Enzymes>

<Enzymes name="Caspase" semiSpecific="0" missedCleavages="1" independent="1"
minDistance="4">
  <enzyme>
    <cleaveSite>DEVD|DQTD|ELPD</cleaveSite>
    <noCleaveSite></noCleaveSite>
    <terminus>C</terminus>
  </enzyme>
  <CTermGain>OH</CTermGain>
  <NTermGain>H</NTermGain>
</Enzymes>

Original comment by dcre...@gmail.com on 29 Jul 2008 at 9:03

GoogleCodeExporter commented 8 years ago

Additional possibility (just as memo from previous discussions): 

Describe the search engine parameter "enzyme" using CV terms, e.g.:
<AdditionalSearchParams>
  <pf:cvParam accession="PSI:0000XYZ" name="Paragon:DefaultEnzyme" cvRef="PSI"/>
  ...
</AdditionalSearchParams>

or

<Enzymes>
  <pf:cvParam accession="PSI:0000XYZ" name="Paragon:DefaultEnzyme" cvRef="PSI"/>
  ...
</Enzymes>

Original comment by eisena...@googlemail.com on 31 Jul 2008 at 9:06

GoogleCodeExporter commented 8 years ago

Original comment by eisena...@googlemail.com on 31 Jul 2008 at 3:02

Added labels: Priority-High
Removed labels: Priority-Medium

GoogleCodeExporter commented 8 years ago

Suggestion: create CV terms for common cases with the regular expression defined
within the term?

Original comment by delag...@gmail.com on 31 Jul 2008 at 3:26

GoogleCodeExporter commented 8 years ago

Angel will generate some examples for OBO terms w/ regex

Original comment by delag...@gmail.com on 31 Jul 2008 at 3:27

GoogleCodeExporter commented 8 years ago

Just out of interest - I tried the example extended regex in Java (from comment 
9
above), it works just as easily in Java as it does in Perl:

String[] peptides = "ABCDKPEFGHIJKLMNOPQRSTUVWXYZ".split("(?<=[KR])(?=[^P])");
for (String peptide : peptides){
   System.out.println("peptide = " + peptide);
}

Output:
peptide = ABCDKPEFGHIJK
peptide = LMNOPQR
peptide = STUVWXYZ

Any other language examples?

Original comment by philip.j...@gmail.com on 31 Jul 2008 at 5:12

GoogleCodeExporter commented 8 years ago

for SEQUEST every possibility is okay which states 
- Offset
- Sites (e.g. "KR" for Trypsin)
- No-sites (e.g. P for Elastase)

Caspase is not possible.

Original comment by eisena...@googlemail.com on 11 Sep 2008 at 1:59

GoogleCodeExporter commented 8 years ago

to sum it up, agreed was in a TeleCon in August:
1) Have the possibility to state a regular expression
2) Have CV terms for the most important enzymes with a pre-defined regexp

Original comment by eisena...@googlemail.com on 11 Sep 2008 at 3:01

GoogleCodeExporter commented 8 years ago

possible XML:

    <cleavageEnzymes>
    <!-- Trypsin cutting cterm of K and R: -->
        <oneCleavageEnzyme identifier="Trypsin" CTermGain="OH" NTermGain="H">
            <cleavageEnzymeCV accession="PSI-PI:000456" name="Trypsin" cvRef="PSI-PI"/>
        </oneCleavageEnzyme>
    <!-- Cleavage C and Nterm of D, and trypsin cleavage at cterm of K and R -->
        <oneCleavageEnzyme identifier="ChemDigest_and_Trypsin" CTermGain="" NTermGain="">
            <siteRegexp><![CDATA[(?<=[DKR])|(?=[D])]]></siteRegexp>
            <cleavageEnzymeCV accession="PSI-PI:000456" name="ChemDigest_and_Trypsin"
cvRef="PSI-PI"/>
        </oneCleavageEnzyme>
    <!-- Caspase (3 sequence patterns) -->
        <oneCleavageEnzyme identifier="Caspase">
            <siteRegexp><![CDATA[(?<=(?:DEVD|DQTD|ELPD))]]></siteRegexp>
            <cleavageEnzymeCV accession="PSI-PI:000567" name="Caspase" cvRef="PSI-PI"/>
        </oneCleavageEnzyme>
    </cleavageEnzymes>

Agree?

Original comment by eisena...@googlemail.com on 11 Sep 2008 at 3:02

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

Added a proposal to the schema in the svn. 
More CV terms to be added!

Original comment by eisena...@googlemail.com on 11 Sep 2008 at 6:04

GoogleCodeExporter commented 8 years ago

Martin, looks good but I Think we agreed to use the names <Enzymes> and 
<Enzyme>,
following the format in, for example #15

<Enzymes independent="1">
  <Enzyme  missedCleavages="2" semiSpecific="0" minDistance="4" identifier="XXX"
CTermGain="SH" NTermGain="H6">
   . . .
  </Enzyme>
  <Enzyme>
  </Enzyme>
</Enzymes>

The 'independent' attribute needs to be in <Enzymes>, but all the other 
attributes
are probably better off in the individual <Enzyme> element as you have done.

For multiple enzymes, if 'independent' is 0, then I suspect that 
missedCleavages,
semiSpecific, and minDistance would generally be the same for each enzyme, but 
they
wouldn't need to be. If independent is 1, then all the attributes could be 
different.

Original comment by dcre...@gmail.com on 16 Sep 2008 at 2:36

GoogleCodeExporter commented 8 years ago

For the OBO terms: 

[Term]
id: PI:00232
name: peptide cleavage enzyme
def: "A general term to represent peptide cleavage enzymes. Cleavage rules are
specified using PCRE version 7.4 compliant regular expressions." [ref:ref]
is_a: PI:00000 ! protein informatics cv

[Term]
id: PI:00232
name: Trypsin
def: "Trypsin enzyme. Cleaves at Lysine and Arginine (Arg [R]) but not when 
either is
followed by Proline (Pro [P]) at the C terminus." 
is_a: PI:00232 ! peptide cleavage enzyme
regex: "(?<=[KR])(?!P)"

etc, etc. ...
Here are the regexes I came up with (note the notes for some questions I had). 

Name    Cleave_at
Trypsin     (?<=[KR])(?!P)
Arg-C   R(?!P)
Asp-N   (?<=[:alpha:])(?=[BD]) # N-terminus cleavages require prefix AA? 
Asp-N_ambic     (?<=[:alpha:])(?=[DE]) # see above
Chymotrypsin    (?<=[FYWL])(?!P)
CNBr    (?<=M)
Formic_acid     ((?<=D))|((?=D)) # note is this either/or or does it excise the 
Asp(D)
completely from the sequence?
Lys-C   (?<=K)(?!P)
Lys-C/P     (?<=K)
PepsinA     (?<=[FL])
Tryp-CNBr   (?<=[KRM])(?!P)
TrypChymo   (?<=[FYWLKR])(?!P)
Trypsin/P   (?<=[KR])
V8-DE   (?<=[BDEZ])(?!P)
V8-E    (?<=[EZ])(?!P)
CNBr+Trypsin    (?<=M)|(?<=[KR])(?!P)
KR  (?<=P)

Original comment by delag...@gmail.com on 2 Oct 2008 at 6:50

GoogleCodeExporter commented 8 years ago

Looks good.
I presume that (?!P) is the same as (?=[^P]), but maybe is clearer to use the 
same
syntax for one residue as for multiple residues?

I think I've corrected these properly:
Asp-N (?=[BD])
Asp-N_ambic (?=[DE])

And I'm pretty sure that your Formic_acid is correct. (I tested by using a perl
script for the regext and comparing with examples in the Mascot configuration 
editor).

The Arg-C doesn't seem to work (removes the 'R'), so should be
(?<=R)(?!P)

I'm not so sure about having multiple enzymes such as CNBr+Trypsin in the CV. 
You've
specified two options:
CNBr+Trypsin    (?<=M)|(?<=[KR])(?!P)
Tryp-CNBr   (?<=[KRM])(?!P)
But I'm not convinced that either work 100% properly for the two cases: 
 - both applied to the same aliquot.
 - applied to separate aliquots and these are then mixed (i.e. both terminii will be
Tryptic or both CNBr, 
Since we have a mechanism for mixed enzymes in the schema, we should probably 
use
that and remove the mixed ones?

I've added an enzyme section for a mixed enzyme (CNBr+Trypsin) to the
Mascot_MSMS_example.axml file in the examples directory.

At the moment, for a mixed enzyme there is no place for the name chosen from the
search form drop down list for the enzyme. Likewise, if in any search engine 
someone
chose to call Trypsin, say "Bovine Trypsin", there's no place for this name as 
we
should just give the accession for Trypsin?

In the schema, we could restrict CTermGain and NTermGain to what we would 
expect in
chemical formulae? [A-Z][a-z][0..9][ ] to stop someone entering a decimal 
number?

I've put the regex plus the CV, don't know if that is what is intended.

Original comment by dcre...@gmail.com on 3 Oct 2008 at 9:02

GoogleCodeExporter commented 8 years ago

>I presume that (?!P) is the same as (?=[^P]), but maybe is clearer to use the 
same
syntax for one residue as for multiple residues?

This is a matter of style. I have a preference to use the PCRE specification for
negating look-ahead and look-behind assertions, which are (?!...) and (?<! ...)
repsectively.  Also I tend to steer towards the most succinct regex, since this 
is
clearer and easier to understand for me. The character class negation seems 
like you
are putting the negation in the wrong place and has the potential for double
negatives (?![^P]). 
Also for character classes, I also tend toward only having a single character 
when
this is the case, as in (?!P) instead of (?![P]). The compiled regex parse tree 
is
different for these, even tho the result should be the same. 

So I propose two notes on style:
1) use the PCRE supplied negation syntax for look-ahead and look-behind 
assertions
2) use the most compact representation possible for a regex.

Original comment by delag...@gmail.com on 3 Oct 2008 at 12:28

GoogleCodeExporter commented 8 years ago

Thanks for the clarification and I'll happily agree to the style notes.
If you agree to not including multiple enzymes in the list, then can you 
confirm that
we have:

Name    Cleave_at
Trypsin     (?<=[KR])(?!P)
Arg-C   (?<=R)(?!P)
Asp-N   (?=[BD])
Asp-N_ambic     (?=[DE]) 
Chymotrypsin    (?<=[FYWL])(?!P)
CNBr    (?<=M)
Formic_acid     ((?<=D))|((?=D)) 
Lys-C   (?<=K)(?!P)
Lys-C/P     (?<=K)
PepsinA     (?<=[FL])
TrypChymo   (?<=[FYWLKR])(?!P)
Trypsin/P   (?<=[KR])
V8-DE   (?<=[BDEZ])(?!P)
V8-E    (?<=[EZ])(?!P)

The only other one that we are lacking is a way to describe 'No enzyme'. A 
regex:
None  (?<=[A-Z])
is only meaningful if we a very large number of missed cleavages. In the current
schema, then Enzymes element is optional, but if you have it, then the Enzyme
element(s) within it are required. So, no enzyme could be specified by just 
ommiting
the Enzymes section, but I'd rather have something explicity say that there was 
no
enzyme specificity. Any ideas?

Also, any comments on:
At the moment, for a mixed enzyme there is no place for the name chosen from the
search form drop down list for the enzyme. Likewise, if in any search engine 
someone
chose to call Trypsin, say "Bovine Trypsin", there's no place for this name as 
we
should just give the accession for Trypsin?

In the schema, we could restrict CTermGain and NTermGain to what we would 
expect in
chemical formulae? [A-Z][a-z][0..9][ ] to stop someone entering a decimal 
number?

Original comment by dcre...@gmail.com on 3 Oct 2008 at 3:44

GoogleCodeExporter commented 8 years ago

The "no enzyme" is a bit of a conundrum , I admit. If we must have it, then that
regex is as good as any. I think that omission of Enzyme is more true to the
semantics of the experimental protocol, but I can see how it could complicate
matters. If we choose to use "No enzyme" then I vote we make Enzyme a mandatory
element and default the value to "No enzyme". Perhaps this is a case where an
attribute of Enzyme can suffice, as opposed to a CV term.... thoughts anyone?

For the multiple enzymes, I thought I put in my last reply that I agreed we do 
not
combine enzymes in the CV, but leave it up to the schema to define the 
combinations.
I guess I didn't. My bad. 

For the search engine parameter issue, I think this is userParam territory. 

Last, for C/NTermGain I think your suggestion is a good one.

Original comment by delag...@gmail.com on 8 Oct 2008 at 6:13

GoogleCodeExporter commented 8 years ago

as agreed in TeleCon 9th of October:

1) changed cardinality of <cvParam> child of <Enzyme> to: one to many (to allow 
synonyms)

2) restricted CTermGain and NTermGain to "[A-Za-z0-9 ] (basic letters of a 
chemical
formula) (can be refined later)

Original comment by eisena...@googlemail.com on 9 Oct 2008 at 4:29

GoogleCodeExporter commented 8 years ago

Continuing with the CV, here are legal OBO definitions that for the most part 
do not
sho up in OBO-edit. 

[Term]
id: PI:00242
name: peptide cleavage enzyme
def: "A general term to represent peptide cleavage enzymes. Cleavage rules are
specified using PCRE version 7.4 compliant regular expressions" [ref:ref]
is_a: PI:00000 ! protein informatics cv

[Term]
id: PI:00243
name: Trypsin
def: "Trypsin enzyme. Cleaves at Lysine and Arginine (Arg [R]) but not when 
either is
followed by Proline (Pro [P]) at the C terminus" [ref:ref]
is_a: PI:00242 ! peptide cleavage enzyme
property_value: cleavage_rule "(?<=[KR])(?\\\!P)" xsd:string

[Instance]
id: PI:00244
name: C-terminal
comment: C Terminal
instance_of: PI:00047 ! cleavage: sense

[Instance]
id: PI:00245
name: N-terminal
comment: C Terminal
instance_of: PI:00047 ! cleavage: sense

[Typedef]
id: cleavage_rule
name: cleavage_rule
domain: OBO:TERM
range: xsd:string ! xsd:string
definition: "Cleavage rule."

# end OBO file
Specifically, the instances and the property_value of the "cleavage_rule" 
Typedef. I
am at a loss as to how to continue. Do we restrict our CV to OBO-edit's 
capabilities?
Or just define the CV using "best-practices".

On that note, it seems that terms PI:00046 PI:00050 seem to richly specify 
enzymes an
make the use of regular expressions moot (e.g  we can choose to go uber-verbose 
and
not put in terms for the major enzymes, thus avoid regular expressions 
altogether and
force definition of enzyme to use all of the terms from PI:00046-50 when 
outputting
an experiment.)

Original comment by delag...@gmail.com on 15 Oct 2008 at 6:25

GoogleCodeExporter commented 8 years ago

Note: I am sure that the above OBO examples have a few syntax mistakes, since I 
could
not test it out in OBO-edit.

Original comment by delag...@gmail.com on 15 Oct 2008 at 7:36

GoogleCodeExporter commented 8 years ago

The OBO edit does not yet (...) support the property_value on terms/classes, I 
did
ask for it. 
Meaningwhile a temporary solution is to use the following syntax (editable and
visible on the OBOedit and other OBOviewers)
xref: value-type:{string,int,xsd} "regular expression"

Original comment by joecoppo...@gmail.com on 17 Oct 2008 at 3:53

GoogleCodeExporter commented 8 years ago

What does the triple backslash do in: "(?<=[KR])(?\\\!P)"

Also, it's probably a good idea to spell out "Perl-compatible regular 
expressions
(PCRE)" because I'm a computer programmer and I didn't know what PCRE meant even
though I know how to use Perl regex. :)

Original comment by matthew....@vanderbilt.edu on 23 Oct 2008 at 4:20

GoogleCodeExporter commented 8 years ago

Two possible variants for encoding regular expressions
for the default enzymes into the OBO file:
1) "xref" and 2) "has_a" relationship.

[Term]
id: PI:00251
name: Trypsin
xref_analog: regexp:(?<=[KR\])(?\!P)
is_a: PI:00045 ! cleavage agent name
relationship: has_a PI:00176 ! (?<=[KR])(?!P)

For the 1st variant, the regular expression is only a string.
For the 2nd variant, the regular expression is itself a 
term (PI:00176) and child of a "regular expression" term.

Both methods have disadvantages:
In OBOEdit the xref gives a warning, because it contains non-URI characters.
In OBOEdit the has_a relationship is not shown in the tree view, but only in 
the Parent Plugin (see screenshot attached).

Which do we prefer?

Original comment by eisena...@googlemail.com on 7 Nov 2008 at 2:03

GoogleCodeExporter commented 8 years ago

TeleCon 12th Nov:

We decided to use the has_a relationship, because its more formal.

Martin: change has_a to has_regular_expression and delete the xrefs.

Original comment by eisena...@googlemail.com on 12 Nov 2008 at 4:11

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

Original comment by dcre...@gmail.com on 7 Dec 2008 at 4:37

Changed state: Fixed

vogelwk / psi-pi

Specifying the enzyme rules used in the search #30