Open GoogleCodeExporter opened 9 years ago
File sketch attached containing two SIList for Mascot and Sequest, then a third
list for combined results showing protocol details etc.
there are some notes in the file showing what problems remain to be solved and
workarounds needed etc
Original comment by andrewro...@googlemail.com
on 16 May 2013 at 3:00
Attachments:
In Scaffold we take an approach that is quite like Option B when producing
mzIdentML; each SpectraData (what we call MS Samples internally) gets a single
SIL, and PSMs within that SIL contain scores for multiple search engines.
Because our goal is to combine multiple search engines' scores, we want to see
all of the scores in one place, so from our point of view, our "consensus list"
(which might be more accurately called a "union list") makes any single-search
SIL redundant. This also means that we strongly prefer having separate CV terms
for similar scores/concepts from different search engines, as any
agreement/disagreement between search engines is valuable information. We care
not only about the value, but how it was produced (i.e. what search engine it
came from). The key point here is that even though scores from different
engines (e-value, for example) may be similar in some respects, they are the
result of very different algorithms, and treating them as interchangeable
discards this information.
When we consume mzIdentML, we take a more relaxed strategy, but one that still
adheres more closely to Option B. Instead of reading SILs separately, we merge
results on a per-spectrum level, so each SpectraData becomes a single sample
internally. This works for multiple types of input files -- if the file has
multiple SILs representing separate searches we merge them to get "union"
results. If a CV term is introduced to indicate which SILs are consensus
results this approach becomes confusing: does the user want to analyze the raw
results, or the consensus ones? It's my opinion that these data should be
presented in different files to reduce the risk of confusing what data is final
and what is intermediate.
The mzIdentML specification is flexible enough that many different sorts of
data can be represented, and different contexts (or steps in a workflow) call
for different sorts of data. While I can see the appeal of a single file that
contains all of the data one might ever want to access, such a file would be
difficult to read and interpret in general, so it makes more sense to me to
produce different files depending on their intended use. If your consensus
results should be read like "just another search engine" it would be fine to
follow what's in the specification currently, and there's no need for an
additional CV term. If the consensus results should be interpreted separately
from the individual search engine results, then they should be in their own
file. Simply shifting this responsibility to consumers of mzIdentML is too much
to ask.
Original comment by seth.j...@proteomesoftware.com
on 22 May 2013 at 4:41
Main issues to solve:
- If we allow both encodings in different files: Option a) n lists (one per n
search engines) and Option B) 1 consensus list, how to signal to a data
consumer what exactly they should read. In the case of Option B, this is fairly
easy, but in the case of Option A - this is difficult, especially for example
if this was a PRIDE submission.
- For the consensus file, how should we record the search protocol and the
software names
- How to enforce or recommend any rules?
Original comment by andrewro...@googlemail.com
on 23 May 2013 at 2:52
Copied over from the list, here is comment from Eric and myself:
Hi Andy, this seems quite reasonable. I would be fine with it. I would also
offer as an alternative a pair of terms such as “final PSM list” and
“intermediate list” that take no value, rather than the true/false. This
solution might seem a little more flexible for adding a third category in the
future. The mapping file would require one or the other. I don’t feel
strongly either way, but offer it as a discussion idea.
Regards,
Eric
From: Jones, Andy [mailto:Andrew.Jones@liverpool.ac.uk]
Sent: Thursday, January 23, 2014 8:40 AM
To: ''psidev-pi-dev@lists.sourceforge.net'
(psidev-pi-dev@lists.sourceforge.net)'
Subject: [Psidev-pi-dev] mzid 1.2 issues
Hi all,
While updating the specifications to version 1.2
(http://code.google.com/p/psi-pi/source/browse/trunk/specification_document/spec
doc1_2/mzIdentML1.2-draft.doc), I want to make sure we have completely sorted
out the encoding for cases involving multiple SpectrumIdentificationLists. As
far as I can tell, there are only two concrete cases we have identified so far:
- Multiple search engines analysing the same input (Section 5.3.1)
- Combination of multiple fractions into a single unit for protein
inference (Section 5.3.2)
There may also be some other unidentified cases that involve multiple
SpectrumIdentificationLists and at the moment reading software cannot easily
process them. My proposal is as follows (not currently in the spec doc).
Add a mandatory (MUST) CV terms to every SpectrumIdentificationList –
“final PSM list” value = “true|false”. We then also state in the spec
doc that consensus results MUST have this term and that original search engine
results (in a multiple search engine file) MUST not. For multiple fraction
files, all lists would have final PSM list=true. Ideally, we would also add a
further check that within final PSM lists that the spectrum identifier
(combination of spectrumID and spectraData_ref) MUST be unique i.e. there is
only one final SpectrumIdentificationResult for each spectrum searched – I
don’t know if this can be done easily in the validator.
I think this is makes it absolutely clear what reading software is expected to
process. The original specs stated that all results are considered final, but
since we want to support multiple search engine approaches, we are implicitly
supporting a pipeline with intermediate and final results, hence another
solution is needed.
We might also choose to add further CV terms qualifying the role of individual
SpectrumIdentificationLists – such as “consensus list” or
“pre-consensus individual search engine list” etc, but I don’t think
these are so important as making life easy for readers.
Any opinions?
Best wishes
Andy
Original comment by andrewro...@googlemail.com
on 31 Jan 2014 at 3:57
Discussed and decided to follow Eric's suggestion with a SHOULD rule, and
document that if they are absent, reading software may only choose to read the
first list. Validation software should check that spectrum is unique within all
"final PSM" lists
Original comment by andrewro...@googlemail.com
on 31 Jan 2014 at 4:26
Original issue reported on code.google.com by
andrewro...@googlemail.com
on 22 Apr 2013 at 9:33