Support for multiple runs in mzid

GoogleCodeExporter commented 8 years ago

Copy of thread from the list:

Hi Matt,

(hide our dirty laundry) "As much as possible? Just get rid of the spectra, the 
XML, and the ambiguity groups! :) MzIdentML is clearly not intended for 
end-users."

:) I get your point, but in the same way, a non-proteomics person doesn't have 
to understand the internal structure of an mzML, wiff, or raw file to know the 
correspondence with runs on the instrument. They just need to understand how it 
maps to their experimental design and what they or someone else did for them in 
the lab. That's what I'm looking for.

" A tool has to be able to combine mzIdentML files to get a more complete 
picture, whether it's from aggregating fractions or aggregating samples. There 
is no escaping aggregation."

There's a huge difference between expecting a downstream tool to combine 
multiple protein lists and combine partial results from multiple fractions - 
the latter requires being able to do protein inference analysis, while the 
former assumes the software producing the result has already done this and only 
comparison/alignment is required. I think protein inference is a pretty hard 
problem relative to alignment and comparison of protein lists, so my feeling is 
that proteomics experimentation will improve in quality much faster if we agree 
to isolate inference to tools that should be doing it to make mzIdentML results 
and let tools that do comparison focus on comparison do that across mzIdentML 
files. Would you want a repository like PRIDE to have to be capable of doing 
high quality protein inference? Obviously, I'm looking at this from a different 
perspective than you, but I think this is really good to try to reconcile the 
concerns at our different levels of focus.

Best -

Sean

-----Original Message-----
From: Matt Chambers [mailto:matt.chambers42@gmail.com]
Sent: Friday, January 04, 2013 9:30 AM
To: psidev-pi-dev@lists.sourceforge.net
Subject: Re: [Psidev-pi-dev] Does mzIdentML support multiple runs? Or just 
multiple searches of one run?

On 1/4/2013 10:22 AM, Seymour, Sean L wrote:
> Hi all - Happy New Year!
>
> Matt has raised an important issue here. It seems like we're going to 
> quickly have a serious dialect problem in how the format is used if there 
aren't clear guidelines in this area.
>
> My opinion is that the guidance should be that one mzIdentML file 
> should correspond to one complete protein ID result of one 
> pre-analytical sample (ie biological sample of interest). It shouldn't 
> matter for the mzIdentML granularity question how many MS runs or 
> fractions or gel spots this sample is broken into for analytical 
> purposes to get between that starting point sample of interest to the 
> list of the peptides and proteins in it. I don't think 'run' is a 
> useful concept for this question. To serve the bigger scientific 
> purpose of proteomics, the granularity should align with scientific 
> experimentation as would be understood by someone who knows nothing about MS 
or proteomics other than that they expect to get protein information about 
samples of interest to them. We should hide our analytical dirty laundry as 
much as possible.

As much as possible? Just get rid of the spectra, the XML, and the ambiguity 
groups! :) MzIdentML is clearly not intended for end-users. I think it's fine 
to have it locked at 1 run per file (but see below...I wrote this email 
backwards).

> The most important thing is that we agree on how many proteins lists 
> should be in an mzIdentML file so that downstream tools know whether 
> they can see an mzIdentML file as a single result or expect they may 
> have to pull it apart in some way. I think we should limit to one 
> protein list per mzIdentML file, as Andy points out is the case now, and not 
enable/allow workarounds via additional CV.

Yes, I don't see any problem with the current single protein list (the depth of 
the list is arbitrary based on the level aggregation).

> One complete protein list per mzIdentML files should also constrain 
> things from straying in the other direction as well - ie analytical fractions 
should not be put in separate mzIdentML files.
> This would be an incomplete protein ID picture of the parent sample in 
> each mzIdentML file and would require a downstream tool to combine 
> multiple mzIdentML files into one final protein ID picture - something 
> a strongly believe we should not allow or expect downstream tools to be able 
to do.

The currently most common practice for search engines is one 
DAT/pepXML/mzIdentML/SQT per run, where a run can represent a fraction of a 
sample, a single sample, or a multiplexed sample. The exceptions I'm familiar 
with are Sequest (which can be one OUT per DTA if you really want to count that 
as an exception), Protein Pilot (one .group per set of runs the user supplies, 
right? but it's totally arbitrary how many runs are included...), and Protein 
Prospector (similar to Protein Pilot IIRC).

I hoped we would keep a 1:1 relationship between mzML/mzXML/MGF/MS2 and 
mzIdentML for simplicity, i.e. 1 run to 1 mzIdentML (but multiple protocols 
allowed), but I don't think it's worth a schema change to force it and Scaffold 
is already exporting multiple runs per mzIdentML. A tool has to be able to 
combine mzIdentML files to get a more complete picture, whether it's from 
aggregating fractions or aggregating samples. There is no escaping aggregation. 
Opening a gigantic unindexed mzIdentML file is pretty absurd.

Thanks for speaking up Sean!
-Matt

>
> My take anyway...
>
> Sean
>
> -----Original Message----- From: Jones, Andy 
> [mailto:Andrew.Jones@liverpool.ac.uk] Sent: Friday, January 04, 2013 4:59 AM 
To: 'Parag Mallick' Cc: 'psidev-pi-dev@lists.sourceforge.net' Subject:
> Re: [Psidev-pi-dev] Does mzIdentML support multiple runs? Or just multiple 
searches of one run?
>
> Yes I don't see any cardinalities that prevent this, with the 
> exception of only allowing a single protein list. It is not possible 
> to have one protein list per fraction, although if this is really 
> required it could probably be achieved using additional CV terms, Best 
> wishes Andy
>
> -----Original Message----- From: Parag Mallick 
> [mailto:paragm@stanford.edu] Sent: 03 January
> 2013 15:44 To: Jones, Andy Cc: psidev-pi-dev@lists.sourceforge.net
> Subject: Re: [Psidev-pi-dev] Does mzIdentML support multiple runs? Or just 
multiple searches of one run?
>
> I had thought you could do multiple runs such as a fractionated sample. Yes?
>
>
> ~ Parag M ~
>
> typos and brevity by iPhone.
>
> On Jan 3, 2013, at 4:33 AM, "Jones, Andy" <Andrew.Jones@liverpool.ac.uk> 
wrote:
>
>> Hi Matt,
>>
>> As I recall, we intended a single file to encapsulate one run, 
>> although I don't know if we formally specified this with a MUST rule 
>> anyway. There is a restriction on only having one protein list, which would 
generally imply that the file corresponds with one run.
>>
>> Cheers Andy
>>
>> -----Original Message----- From: Matthew Chambers 
>> [mailto:matt.chambers42@gmail.com] Sent: 31 December 2012 16:34 To:
>> psidev-pi-dev@lists.sourceforge.net Subject: [Psidev-pi-dev] Does mzIdentML 
support multiple runs? Or just multiple searches of one run?
>>
>> This isn't spelled out clearly in the specification. Clearly it can 
>> support multiple searches of an MS run, but it isn't specified 
>> whether it can do so for one search against multiple runs, or multiple 
searches against multiple runs.
>>
>> Thanks, -Matt

------------------------------------------------------------------------------
Master HTML5, CSS3, ASP.NET, MVC, AJAX, Knockout.js, Web API and much more. Get 
web development skills now with LearnDevNow -
350+ hours of step-by-step video tutorials by Microsoft MVPs and experts.
SALE $99.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122812
_______________________________________________
Psidev-pi-dev mailing list
Psidev-pi-dev@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/psidev-pi-dev

Please be advised that this email may contain confidential information.  If you 
are not the intended recipient, please notify us by email by replying to the 
sender and delete this message.  The sender disclaims that the content of this 
email constitutes an offer to enter into, or the acceptance of, any agreement; 
provided that the foregoing does not invalidate the binding effect of any 
digital or other electronic reproduction of a manual signature that is included 
in any attachment.

------------------------------------------------------------------------------
Master HTML5, CSS3, ASP.NET, MVC, AJAX, Knockout.js, Web API and much more. Get 
web development skills now with LearnDevNow -
350+ hours of step-by-step video tutorials by Microsoft MVPs and experts.
SALE $99.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122812
_______________________________________________
Psidev-pi-dev mailing list
Psidev-pi-dev@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/psidev-pi-dev

Original issue reported on code.google.com by andrewro...@googlemail.com on 23 Jan 2013 at 3:32

GoogleCodeExporter commented 8 years ago

One extra note - I would like to discuss this briefly at the PSI 2013 meeting.

We need to decide:

- Whether this is allowed at all (I vote yes, so long as protein inference does 
not need to be repeated)
- Whether combined results should be placed in one SIList or in multiple 
SILists (not sure I have a preference on this one at the moment - we need to 
check the specification carefully for what an SIList is supposed to represent)

Original comment by andrewro...@googlemail.com on 5 Mar 2013 at 2:46

GoogleCodeExporter commented 8 years ago

Discussed at PSI2013

Summary:

Pre-fractionation results can be combined in mzIdentML. Different use cases, 
e.g. 10 fractions producing 10 MGF files:

- Search engine combines MGF files before searching together / protein 
inference together --> Then encode as one mzid file, with one SIList, one 
ProteinList

- If search engine does 10 searches, produces 10 mzid files. Post-processing 
software takes these and combines, then does protein inference. Strong 
preference for final output file to combine the results into one single SIList, 
one protein list. Alternative encoding to produce 10 SILists is a pain for 
reading/viewing software.

ACTION: Update specification document with this implementation recommendation, 
it cannot obviously be enforced.

Further note on multiple search engine results. Preference is also to report 
final results in one single list, showing combined scores. Reading software 
does not easily deal with multiple lists - would have to do own combination and 
re-ranking of results.

General principle - each spectrum SHOULD only be reported once per file.

ACTION: Update to spec doc with this recommendation. 

Also discussed CID/ETD - these SHOULD be reported in separate SILists, since 
protocol and everything else about search is different.

ACTION: Update to spec doc with this recommendation.

Original comment by andrewro...@googlemail.com on 17 Apr 2013 at 1:40

GoogleCodeExporter commented 8 years ago

File attached showing the recommended encoding for fractions (hand-crafted and 
incomplete)

Original comment by andrewro...@googlemail.com on 16 May 2013 at 2:39

GoogleCodeExporter commented 8 years ago

2nd attempt to attach the file

Original comment by andrewro...@googlemail.com on 16 May 2013 at 2:40

Attachments:

55merge_omssa_fractions.mzid

GoogleCodeExporter commented 8 years ago

Here are the minutes from last week's call repeated here:

****
- How to represent fractions in mzIdentML?
- Would it be possible to represent the multiple runs and the collapsed view in 
the same file? In principle, the preferred way would be to allow only the 
combined results (from the different fractions/runs) but not the invividual 
results in the same file.

Eric commented that if only the collapsed view was possible it would not be 
possible to convert in both directions mzIdentML and pep.xml files.

Possible solutions to allow this:

- Use CV param to say if one list of PSMs is the final result or not 
(true/false)
- Use CV params to name specific fractions.

No solution yet: it needs to be discussed further.

*****

Further to this, I think we should make it easy to convert back and forth with 
pepXML, where separate fractions are maintained in separate lists (in the same 
file I think?). As such, we need to have a general mechanism for telling a data 
consumer what to do when they see multiple lists. The assumption is that these 
are all "final" results (since that is a key principle for mzIdentML).

For multiple fractions this is okay and supports either separate lists or 
single lists (depending on how the search was done). The difficulty comes in 
relation to the multiple search engine discussion - should reading software be 
expected to determine how to re-rank results if it sees different SIRs (in 
different lists) referencing the same spectrum.

Original comment by andrewro...@googlemail.com on 23 May 2013 at 2:59

vogelwk / psi-pi

Support for multiple runs in mzid #75