Support for de novo and other approaches without canonical proteins

GoogleCodeExporter commented 9 years ago

Hi all,

Sean pointed out to me that the current mzid specifications appear to rule out 
the coding of de novo approaches. When we made the transition from mzid 1.0 to 
1.1, we made PeptideEvidenceRef 1..many on SII - meaning that at least one link 
to a database protein is provided for every SII.

For de novo approaches, clearly there is no DBSequence record to reference. 
Other parts of the schema imply (at least partial) support for de novo.

I can't recall the rationale for this change - I think some exporters were only 
adding links only to significantly identified proteins, which isn't really the 
purpose of this; we took the decision that we wanted to know where ever peptide 
had come from.

Anyway, we need to fix this so that de novo is really supported. Some options

Option 1. Make PeptideEvidence_ref 0..many. 

Pros: Seems sensible that not all peptide identifications may be linked to a 
protein sequence that has been digested.

Cons: We would have to accept that some exporters from standard search engines 
will produce files with some, many or all PSMs not actually linking to 
DBSequence - which would be a massive pain for any post-processing software. My 
experience has been that everything we make optional in the format tends to be 
used wrongly by developers not reading the specs properly. 

Option 2. Have a workaround with a DBSequence element called "de novo" or some 
such to which all Peptide evidences reference, with start=-1, end=-1, pre=?, 
post=? (as discussed here: http://code.google.com/p/psi-pi/issues/detail?id=79 
in a different context)

Option 3. Add a schema choice: 1..many PeptideEvidence_refs OR 1 
<de_novo_search/> element, thus forcing exporters to put PeptideEvidence_ref on 
every PSM, unless it is a de_novo_search

Opinions? I would favour 1 or 3 - 3 seems cleanest to me, but I can see 
arguments in favour of 1 - this would mean fewest changes to software.

best wishes
andy

Original issue reported on code.google.com by andrewro...@googlemail.com on 23 Jan 2014 at 5:38

GoogleCodeExporter commented 9 years ago

Hi all,

I would favour options 2 or 3. Option 2 is a hack and Option 3 implies a schema 
change. If the idea is to support "properly" de novo approaches, option 3 would 
be the preferred one.

But I have the feeling that option 1 (which also implies a schema change) will 
make files potentially less consistent, and will make life much harder for 
readers.

Best regards,

Juan

Original comment by javizca74@gmail.com on 24 Jan 2014 at 7:46

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Discussed today - we favour option 1; and document that file readers may choose 
to ignore PSMs (from database search) if they do not have PeptidEvidenceRef

Original comment by andrewro...@googlemail.com on 31 Jan 2014 at 4:17

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

In the mzid1.2-draft schema I have provisionally made this change, and updated 
DBSequence to be 0..many (from 1..many in mzid1.1). I still have some 
reservations about it though. 

I would be happier if we could add a validation rule that says if the search is:
        <SearchType>
            <cvParam accession="MS:1001083" cvRef="PSI-MS" name="ms-ms search"/>
        </SearchType>

PeptideEvidenceRef MUST be present on every SpectrumIdentificationItem. I worry 
that we will get lots of badly exported files, with no links at all from PSMs 
to proteins, due to the spec document not being read properly.

Salvador - would this be possible to implement?

Original comment by andrewro...@googlemail.com on 27 Feb 2014 at 1:50

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I guess that any of the following cvParams should "trigger" the MUST rule on 
SII:

ID: MS:1001584  
Name: combined pmf + ms-ms search

ID: MS:1001083  
Name: ms-ms search

ID: MS:1001081  
Name: pmf search

Do you agree?

Original comment by smartinez@proteored.org on 27 Feb 2014 at 6:51

Added labels: ****
Removed labels: ****

mwalzer / psi-pi

Support for de novo and other approaches without canonical proteins #82