Guidelines for multiple search engine results

GoogleCodeExporter commented 9 years ago

Eric posted this to the PSI-PI list, so copied here for tracking:

By way of background, we at ISB have been testing to the ProteoWizard idconvert 
tool to convert our TPP output (pepXML & protXML) into mzIdentML. Salva and 
Alex have been converting their TPP output via a different path, and Harald has 
been converting his PeptideShaker output (which is quite similar to some TPP 
tools) via his own code. Thus far, we’re encoding very similar concepts in 
different ways, none of which really seem right to me. What follows below is 
part recommendation and part questions and plea for decisions. I have split it 
up into several topics.

1) The first issue is how to represent different search engines in one result. 
I’m told that there is an mzIdentML example that encodes it something like 
this:

<SpectrumIdentificationList id="SIL_SEQUEST">
  all <SpectrumIdentifcationResult>s
</SpectrumIdentificationList>
<SpectrumIdentificationList id="SIL_Mascot">
  all <SpectrumIdentifcationResult>s
</SpectrumIdentificationList>

But there is no example of a consensus result. I also hear that there was some 
suggestion that it would be easier to just have a single 
<SpectrumIdentificationList> and within each <SpectrumIdentificationResult>, 
supply *all* of the search engine scores all at once.

After some discussion, we seemed to think that encoding analysis that encoded 
two (or N) different search engines plus a consensus (such as with iProphet or 
PeptideShaker) as:

<SpectrumIdentificationList id="SIL_SEQUEST">
  all <SpectrumIdentifcationResult>s
</SpectrumIdentificationList>
<SpectrumIdentificationList id="SIL_Mascot">
  all <SpectrumIdentifcationResult>s
</SpectrumIdentificationList>
<SpectrumIdentificationList id="SIL_Consensus">
  all <SpectrumIdentifcationResult>s with cvParam "consensus spectrum identification"
</SpectrumIdentificationList>

which I will term “Option A”. In contrast, “Option B” would just have a 
single list which consensus results and scores intermingled with the different 
engines.

The advantage of option A is that it keeps all the different search engine 
results separate and the consensus can be a single result that in principle 
could be easily used by viewers. The disadvantage perhaps is that it make the 
job of a view that wants to show separate search engine results together a bit 
harder since results from very different parts of the document must be merged.

Option A also appears compatible with our pursuit of consolidating common 
search engine scores/concepts (like e-value) rather than repeating endlessly 
SearchEngine:Score terms. Option B would apparently require us to have a 
separate set of terms for each search engine irrespective of their similarity.

It seemed in our little group that Option A is preferred. The only addition 
required would be a new cvParam to denote SpectrumIdentificationResults as 
consensus as shown above.

Thoughts? If we are agreed, we should formalize this so there is Only One Way 
To Do It.

******************************

Specification document states the following:

5.3.1   Multiple database search engines
Proteomics groups now commonly analyze MS data using multiple search engines 
and combine results to improve the number of peptide and protein 
identifications that can be made. The output of such approaches can be 
represented in mzIdentML as follows (see Section 6 for documentation of the 
model elements). Each database search SHOULD be represented by an instance of 
<SpectrumIdentification> (application of the protocol) which references the 
<SpectrumIdentificationProtocol> and the output data: an instance of 
<SpectrumIdentifcationList>. As such, if three database search engines are 
used, there SHOULD be three instances each of <SpectrumIdentification>, 
<SpectrumIdentificationProtocol> and <SpectrumIdentifcationList>. Results are 
then combined into a list of proteins by a separate process, represented as one 
instance of <ProteinDetection> (application of the protocol), which references 
one instance of <ProteinDetectionProtocol> and references (as input) the three 
instances of <SpectrumIdentificationList>. The output of <ProteinDetection> is 
one instance of <ProteinDetectionList>. If a secondary scoring scheme is used 
to weigh evidence for peptide-spectrum matches according to the search engines 
that have identified them, any consensus or composite scores should be assigned 
to each <SpectrumIdentificationItem> within parallel lists.

It was decided that more complex arrangements of workflows cannot be 
represented in mzIdentML version 1.1, such as different protein lists produced 
by each search engine, then combined by an additional process, since it becomes 
difficult to define which are “final” and which are “intermediate” 
results for data consumers and implementers of databases. Such workflows may be 
incorporated into later versions of the format. 

***************************

Comment from Andy Jones:

Agree this is important to solve, I think the specification document is not the 
most sensible way to do this, as it does not describe adequately where the 
consensus results should go.

For data consumers, it is most important to recognise which are the final 
results to load into a database i.e. the consensus results. The original search 
engine results could be argued to be intermediates.

If we agree to Eric's option A, my preference is to have a general mechanism 
for flagging that one (or more in rare cases) SpectrumIdentificationList is 
"final". (Note in mzQuantML, this is an attribute on PeptideConsensusList.) 
This is also needed to support use cases where different parameter sets are 
used to analyse the same set of spectra e.g. wide and narrow tolerances.

The overarching problem is that we might want to do one thing internally in a 
pipeline (Option A certainly), but say loading results into PRIDE (or 
visualisation) would be much easier with Option B. What we have described in 
the specification document is bad for both cases!

On balance my proposal is as follows:

- Keep allowing multiple lists, but add to the specification document that 
where multiple lists are provided - containing the same spectra in different 
list - that exactly one list should be flagged with a new CV term "final 
SpectrumIdentificationList". In most cases, this list will have been the input 
to ProteinDetectionProtocol. 

- Cases such as ETD + CID on the same precursors are covered here as separate 
lists (both flagged as final) - since these are different MS2 spectra.

- Note: this might also impact on the pre-fractionation discussions

Please add comments on the Issues list rather than the mailing list. Thanks!

Original issue reported on code.google.com by andrewro...@googlemail.com on 22 Apr 2013 at 9:33

GoogleCodeExporter commented 9 years ago

File sketch attached containing two SIList for Mascot and Sequest, then a third 
list for combined results showing protocol details etc.

there are some notes in the file showing what problems remain to be solved and 
workarounds needed etc

Original comment by andrewro...@googlemail.com on 16 May 2013 at 3:00

Added labels: ****
Removed labels: ****

Attachments:

MPC_example_Multiple_search_engines_sketch_with_consensus.mzid

GoogleCodeExporter commented 9 years ago

In Scaffold we take an approach that is quite like Option B when producing 
mzIdentML; each SpectraData (what we call MS Samples internally) gets a single 
SIL, and PSMs within that SIL contain scores for multiple search engines. 
Because our goal is to combine multiple search engines' scores, we want to see 
all of the scores in one place, so from our point of view, our "consensus list" 
(which might be more accurately called a "union list") makes any single-search 
SIL redundant. This also means that we strongly prefer having separate CV terms 
for similar scores/concepts from different search engines, as any 
agreement/disagreement between search engines is valuable information. We care 
not only about the value, but how it was produced (i.e. what search engine it 
came from). The key point here is that even though scores from different 
engines (e-value, for example) may be similar in some respects, they are the 
result of very different algorithms, and treating them as interchangeable 
discards this information.

When we consume mzIdentML, we take a more relaxed strategy, but one that still 
adheres more closely to Option B. Instead of reading SILs separately, we merge 
results on a per-spectrum level, so each SpectraData becomes a single sample 
internally. This works for multiple types of input files -- if the file has 
multiple SILs representing separate searches we merge them to get "union" 
results. If a CV term is introduced to indicate which SILs are consensus 
results this approach becomes confusing: does the user want to analyze the raw 
results, or the consensus ones? It's my opinion that these data should be 
presented in different files to reduce the risk of confusing what data is final 
and what is intermediate.

The mzIdentML specification is flexible enough that many different sorts of 
data can be represented, and different contexts (or steps in a workflow) call 
for different sorts of data. While I can see the appeal of a single file that 
contains all of the data one might ever want to access, such a file would be 
difficult to read and interpret in general, so it makes more sense to me to 
produce different files depending on their intended use. If your consensus 
results should be read like "just another search engine" it would be fine to 
follow what's in the specification currently, and there's no need for an 
additional CV term. If the consensus results should be interpreted separately 
from the individual search engine results, then they should be in their own 
file. Simply shifting this responsibility to consumers of mzIdentML is too much 
to ask.

Original comment by seth.j...@proteomesoftware.com on 22 May 2013 at 4:41

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Main issues to solve:

- If we allow both encodings in different files: Option a) n lists (one per n 
search engines) and Option B) 1 consensus list, how to signal to a data 
consumer what exactly they should read. In the case of Option B, this is fairly 
easy, but in the case of Option A - this is difficult, especially for example 
if this was a PRIDE submission.

- For the consensus file, how should we record the search protocol and the 
software names

- How to enforce or recommend any rules?

Original comment by andrewro...@googlemail.com on 23 May 2013 at 2:52

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Copied over from the list, here is comment from Eric and myself:

Hi Andy, this seems quite reasonable. I would be fine with it. I would also 
offer as an alternative a pair of terms such as “final PSM list” and 
“intermediate list” that take no value, rather than the true/false. This 
solution might seem a little more flexible for adding a third category in the 
future. The mapping file would require one or the other. I don’t feel 
strongly either way, but offer it as a discussion idea.

Regards,
Eric

From: Jones, Andy [mailto:Andrew.Jones@liverpool.ac.uk] 
Sent: Thursday, January 23, 2014 8:40 AM
To: ''psidev-pi-dev@lists.sourceforge.net' 
(psidev-pi-dev@lists.sourceforge.net)'
Subject: [Psidev-pi-dev] mzid 1.2 issues

Hi all,

While updating the specifications to version 1.2 
(http://code.google.com/p/psi-pi/source/browse/trunk/specification_document/spec
doc1_2/mzIdentML1.2-draft.doc), I want to make sure we have completely sorted 
out the encoding for cases involving multiple SpectrumIdentificationLists. As 
far as I can tell, there are only two concrete cases we have identified so far:

-        Multiple search engines analysing the same input (Section 5.3.1)
-        Combination of multiple fractions into a single unit for protein 
inference (Section 5.3.2)

There may also be some other unidentified cases that involve multiple 
SpectrumIdentificationLists and at the moment reading software cannot easily 
process them. My proposal is as follows (not currently in the spec doc).

Add a mandatory (MUST) CV terms to every SpectrumIdentificationList – 
“final PSM list” value = “true|false”. We then also state in the spec 
doc that consensus results MUST have this term and that original search engine 
results (in a multiple search engine file) MUST not. For multiple fraction 
files, all lists would have final PSM list=true. Ideally, we would also add a 
further check that within final PSM lists that the spectrum identifier 
(combination of spectrumID and spectraData_ref) MUST be unique i.e. there is 
only one final SpectrumIdentificationResult for each spectrum searched – I 
don’t know if this can be done easily in the validator. 

I think this is makes it absolutely clear what reading software is expected to 
process. The original specs stated that all results are considered final, but 
since we want to support multiple search engine approaches, we are implicitly 
supporting a pipeline with intermediate and final results, hence another 
solution is needed.

We might also choose to add further CV terms qualifying the role of individual 
SpectrumIdentificationLists – such as “consensus list” or 
“pre-consensus individual search engine list” etc, but I don’t think 
these are so important as making life easy for readers.

Any opinions?
Best wishes
Andy

Original comment by andrewro...@googlemail.com on 31 Jan 2014 at 3:57

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Discussed and decided to follow Eric's suggestion with a SHOULD rule, and 
document that if they are absent, reading software may only choose to read the 
first list. Validation software should check that spectrum is unique within all 
"final PSM" lists

Original comment by andrewro...@googlemail.com on 31 Jan 2014 at 4:26

Added labels: ****
Removed labels: ****

mwalzer / psi-pi

Guidelines for multiple search engine results #77