mwalzer / psi-pi

Automatically exported from code.google.com/p/psi-pi
0 stars 0 forks source link

Issues with the CV #42

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
(Collect change requests for the CV here.)

[Term]
id: PI:00216
name: sequest:PeptideRank
def: "The SEQUEST result 'Rank' in out file (peptide)." [ref:ref]
is_a: PI:00153 ! search engine specific score

I think every search engine has the concept of rank: Make this generic?

Original issue reported on code.google.com by dcre...@gmail.com on 8 Aug 2008 at 9:10

GoogleCodeExporter commented 9 years ago

Original comment by dcre...@gmail.com on 8 Aug 2008 at 9:10

GoogleCodeExporter commented 9 years ago
Need separate CV for precursor and fragment (ms-ms) mass types. (Monoisotopic or
Average). Currently just have:

[Term]
id: PI:00210
name: mass type settings
def: "The type of mass difference value to be considered by the search engine
(monoisotopic or average)." [ref:ref]
is_a: PI:00184 ! search run details

[Term]
id: PI:00211
name: mass type setting monoisotopic
is_a: PI:00210 ! mass type settings

[Term]
id: PI:00212
name: mass type setting average isotopic
is_a: PI:00210 ! mass type settings

Original comment by dcre...@gmail.com on 8 Aug 2008 at 12:36

GoogleCodeExporter commented 9 years ago
Need CV for the following types of spectra files (in addition to dta and mgf):

Micromass (.PKL) 
PerSeptive (.PKS) 
Sciex API III 
Bruker (.XML) 
mzData (.XML) 
mzML

Original comment by dcre...@gmail.com on 9 Oct 2008 at 2:36

GoogleCodeExporter commented 9 years ago
Add software vendor term: (Should this be generic as below, or one per vendor?)

    <AnalysisSoftware id="AS_mascot_server" name="Mascot Server" version="2.2.03">
      <pf:ContactRole Contact_ref="ORG_MSL">
        <pf:role>
          <pf:cvParam accession="TODO" name="software vendor" cvRef="TODO"
value="Matrix Science" />
        </pf:role>
      </pf:ContactRole>

Original comment by dcre...@gmail.com on 16 Oct 2008 at 4:48

GoogleCodeExporter commented 9 years ago
Need CV terms for:
- specifying the decoy pattern and 
- the instrument type (or do I have to specify the ion series considered?)

Need CV term for the "Dalton" unit, but I think I find it in another CV.

Need CV term for Trypsin (and its regexp); Angel proposed it, but where it it 
in the obo?

Original comment by eisena...@googlemail.com on 23 Oct 2008 at 2:57

GoogleCodeExporter commented 9 years ago
changes to the CV:
- I restructured it into major branches
- considered all comments above (0-5)
- deleted terms now in schema (own branch MOVED_to_schema, later to be deleted)
- added units, some mascot input parameters, roles, instrument types, ion series
(input parameters)

Original comment by eisena...@googlemail.com on 24 Oct 2008 at 3:01

GoogleCodeExporter commented 9 years ago
further changes to the CV:
- added regular expressions for default enzymes as Dbxrefs (will be replaced by
'has_regexp' relations later)

Original comment by eisena...@googlemail.com on 24 Oct 2008 at 3:18

GoogleCodeExporter commented 9 years ago
How to preceed with modifications? Where to get an OBO file for the accessions 
and
relations?

Original comment by a.bertsc...@googlemail.com on 5 Nov 2008 at 9:06

GoogleCodeExporter commented 9 years ago
Add 'empty' cv terms for location in the schema where cv terms are allowed but 
no
terms are in cv

Original comment by a.bertsc...@googlemail.com on 5 Nov 2008 at 9:08

GoogleCodeExporter commented 9 years ago
Why do we define a new term unitAccession, and not just use the unit ontology,
e.g. UO:0000221

<FragmentTolerance>
<PlusValue  unitAccession="PI:xxxxx" unitName="Da" value="0.5" />

Original comment by a.bertsc...@googlemail.com on 5 Nov 2008 at 1:47

GoogleCodeExporter commented 9 years ago
We should define which CV term can be used with a value. And we should test 
this in
the semantic validator that only those terms are use with a value, which are 
allowed!

the psi-ms ontology does this in the following way:
xref: value-type:xsd\:float "The allowed value-type for this CV term."

Original comment by a.bertsc...@googlemail.com on 5 Nov 2008 at 1:48

GoogleCodeExporter commented 9 years ago
Where do we get a NEWT ontology (taxonomy specification) file from? 

Original comment by a.bertsc...@googlemail.com on 5 Nov 2008 at 1:53

GoogleCodeExporter commented 9 years ago
Here (attached) is  a list of CV terms which are not mapped to the schema at the
moment (along with their parents). Just write where to put specific CV terms
(subtrees of the ontology) per Mail, and I'll fix the mapping file. 

Original comment by a.bertsc...@googlemail.com on 5 Nov 2008 at 5:20

Attachments:

GoogleCodeExporter commented 9 years ago
(Regarding comment 12 above).  I have checked with Richard whether OLS uses an 
OBO
file of NEWT / NCBI taxonomy for loading.  Unfortunately this is not the case, 
so
will need to look elsewhere for a solution.

Original comment by philip.j...@gmail.com on 5 Nov 2008 at 5:54

GoogleCodeExporter commented 9 years ago
Unused cvParam locations (mapping cannot be defined, because we have no valid 
terms
at the moment)

/psi-pi:AnalysisXML/psi-pi:SequenceCollection/psi-pi:Peptide/pf:cvParam/@accessi
on
/psi-pi:AnalysisXML/psi-pi:SequenceCollection/psi-pi:Peptide/psi-pi:Modification
/pf:cvParam/@accession
/psi-pi:AnalysisXML/psi-pi:SequenceCollection/psi-pi:Peptide/psi-pi:Substitution
Modification/pf:cvParam/@accession
/psi-pi:AnalysisXML/psi-pi:AnalysisCollection/psi-pi:SpectrumIdentification/psi-
pi:_runtimeParams/pf:cvParam/@accession
/psi-pi:AnalysisXML/psi-pi:AnalysisCollection/psi-pi:ProteinDetection/psi-pi:_an
alysisParams/pf:cvParam/@accession
/psi-pi:AnalysisXML/psi-pi:AnalysisProtocolCollection/psi-pi:SpectrumIdentificat
ionProtocol/psi-pi:ModificationParams/psi-pi:SearchModification/psi-pi:ModName
/psi-pi:AnalysisXML/psi-pi:AnalysisProtocolCollection/psi-pi:SpectrumIdentificat
ionProtocol/psi-pi:MassTable/pf:cvParam/@accession
/psi-pi:AnalysisXML/psi-pi:AnalysisProtocolCollection/psi-pi:SpectrumIdentificat
ionProtocol/psi-pi:MassTable/psi-pi:AmbiguousResidue/pf:cvParam/@accession
/psi-pi:AnalysisXML/psi-pi:DataCollection/psi-pi:Inputs/psi-pi:SearchDatabase/pf
:fileFormat/pf:cvParam/@accession
/psi-pi:AnalysisXML/psi-pi:DataCollection/psi-pi:AnalysisData/psi-pi:SpectrumIde
ntificationList/psi-pi:SpectrumIdentificationResult/psi-pi:SpectrumIdentificatio
nItem/psi-pi:PeptideEvidence/pf:cvParam/@accession
/psi-pi:AnalysisXML/psi-pi:DataCollection/psi-pi:AnalysisData/psi-pi:SpectrumIde
ntificationList[3]/psi-pi:SpectrumIdentificationResult[4]/pf:cvParam/@accession
/psi-pi:AnalysisXML/psi-pi:DataCollection/psi-pi:AnalysisData/psi-pi:SpectrumIde
ntificationList[5]/pf:cvParam/@accession
/psi-pi:AnalysisXML/psi-pi:DataCollection/psi-pi:AnalysisData/psi-pi:ProteinDete
ctionList/pf:cvParam/@accession
/psi-pi:AnalysisXML/psi-pi:DataCollection/psi-pi:AnalysisData/psi-pi:ProteinDete
ctionList/psi-pi:ProteinAmbiguityGroup/pf:cvParam/@accession

Original comment by a.bertsc...@googlemail.com on 11 Nov 2008 at 12:50

GoogleCodeExporter commented 9 years ago
http://psi-pi.googlecode.com/svn/trunk/cv/axml-mapping.html is a html-page which
contains all mapping rules and the cv-terms which can be used. This may help to
correct the example instance documents, validation errors of the examples will 
follow.

Original comment by a.bertsc...@googlemail.com on 11 Nov 2008 at 12:55

GoogleCodeExporter commented 9 years ago
Please find attached an output of the sematic validator for the example instance
documents. Some errors might be due two missing CVTerms or missing mapping 
rules.

Original comment by a.bertsc...@googlemail.com on 11 Nov 2008 at 12:58

Attachments:

GoogleCodeExporter commented 9 years ago
update list of unused cv terms

Original comment by a.bertsc...@googlemail.com on 12 Nov 2008 at 4:28

Attachments:

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
In the CV there is a "search engine specific score" branch containing e.g.
"mascot:expectation value" and others.

It is good to have the search engine specific scores under one parent term, but 
most
of them should (additionally) be child terms of "peptide result information" or
"protein result information". I think that would improve the validator mapping.

We can give them multiple parents...

Original comment by eisena...@googlemail.com on 19 Nov 2008 at 4:20

GoogleCodeExporter commented 9 years ago
For the mapping file, PI:00088 needs to be allowed under: 
SequenceCollection/DBSequence

We seem to have two sets of fragment types. For example

[Term]
id: PI:00220
name: frag: y ion
is_a: PI:00221 ! fragmentation information

[Term]
id: PI:00262
name: param: y ion
is_a: PI:00066 ! ions series considered in search

Maybe this is OK, but maybe we can just have one lot?

For the params, also need:
       "TODO: Need CV terms for a-NH3 and also a - NH3 if a significant and fragment
includes RKNQ";
       "TODO: Need CV terms for a-H20 and a - H2O if a significant and fragment
includes STED";
       "TODO: Need CV terms for b-NH2 and also b - NH3 if b significant and fragment
includes RKNQ"
       "TODO: Need CV terms for b-H20 and b - H2O if b significant and fragment
includes STED";
       "TODO: Need CV terms for y - NH3 and also y - NH3 if y significant and
fragment includes RKNQ";
       "TODO: Need CV terms for y - H20 and also y - H2O if y significant and
fragment includes STED";
       "TODO: Need CV terms for internal yb";
       "TODO: Need CV terms for z+1 series";
       "TODO: Need CV terms for z+2 series";

I think that this one shouldn't require any value: 
id: PI:00020
name: DB filter taxonomy
def: "The taxonomy filter applied (if any) to the database search." [PSI:PI]
xref: value-type:xsd\:string "The allowed value-type for this CV term."
is_a: PI:00019 ! database filtering

See the relevant section of:
http://code.google.com/p/psi-pi/wiki/NotesForDocumentation

Original comment by dcre...@gmail.com on 19 Nov 2008 at 7:40

GoogleCodeExporter commented 9 years ago
For parameters that require units, we should follow the PSI-MS structure as 
follows,
linking explicitly to the Unit CV

[Term]
id: MS:1000004
name: sample mass
def: "Total mass of sample used." [PSI:MS]
xref: value-type:xsd\:float "The allowed value-type for this CV term."
is_a: MS:1000548 ! sample attribute
relationship: has_units UO:0000002 ! mass unit

Original comment by andrewro...@googlemail.com on 20 Nov 2008 at 10:00

GoogleCodeExporter commented 9 years ago
Some of the XSD data types look to be incorrect (e.g. see below), several 
instances
of xsd:decimal should be xsd:float or xsd:double

[Term]
id: PI:00154
name: sequest:probability
def: "The SEQUEST result 'Probability'." [PSI:PI]
xref: value-type:xsd\:decimal "The allowed value-type for this CV term."
is_a: PI:00153 ! search engine specific score

Original comment by andrewro...@googlemail.com on 20 Nov 2008 at 10:42

GoogleCodeExporter commented 9 years ago
The CV needs a version number e.g. added as "remark" at the top of the file,
following the convention of PSI-MS. The spec doc (copied from mzML) states:

A new psi-pi.obo should then be released by updating the file on the CVS server
without changing the name of the file (this would alter the propagation of the 
file
to the OBO website and to other ontology services that rely on file stable 
URI). For
this reason an internal version number with two decimals (x.y.z) should be 
increased:
• x should be increased when a first level term are renamed added deleted or
rearranged in the structure. Such rearrangement is suppose to be rare and is 
very
likely to have repercussion on the mapping.
• y should be increased when any other term except the first level one is 
altered.
• z should be increased when there is no term addition or deletion but just 
editing
on the definitions or other minor changes.

Original comment by andrewro...@googlemail.com on 20 Nov 2008 at 10:54

GoogleCodeExporter commented 9 years ago
For the mapping file:

Error: CV term used in invalid element: 'UO:0000187 - percent' at element
'/AnalysisXML/AnalysisProtocolCollection/SpectrumIdentificationProtocol/ParentTo
lerance'
Error: Value of CVTerm not allowed: 'UO:0000187 - percent, value=0.1' at element
'/AnalysisXML/AnalysisProtocolCollection/SpectrumIdentificationProtocol/ParentTo
lerance'
Error: Value of CVTerm not allowed: 'UO:0000221 - dalton, value=0.5' at element
'/AnalysisXML/AnalysisProtocolCollection/SpectrumIdentificationProtocol/Fragment
Tolerance'

Original comment by dcre...@gmail.com on 20 Nov 2008 at 2:54

GoogleCodeExporter commented 9 years ago
Please add

[Term]
id: PI:0???
name: text file
is_a: PI:00043 ! input data type

For a simple text file of 
  m/z [intensity] 
values for a PMF (or single MS-MS?) search

Original comment by dcre...@gmail.com on 20 Nov 2008 at 3:10

GoogleCodeExporter commented 9 years ago
From phone conf 11/20: 
Change Quality estimate score "by eye" term to "manual validation" or something 
like
that. 

Term needed for "number of matched/unmatched peaks"?

Original comment by delag...@gmail.com on 20 Nov 2008 at 4:54

GoogleCodeExporter commented 9 years ago
Edited OBO to fufill comments 20-27.

TODOs:
- NEWT.obo (Phil?)
- comment 22 (units like in PSI-MS)
- comment 25 (problem with values in validation)
- comment 27 (terms for matched/unmacthed peaks: we have terms "number of peaks
matched", "number of peaks submitted", "number of peaks used"; enough?)

Original comment by eisena...@googlemail.com on 21 Nov 2008 at 2:36

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
There was an action defined in TeleCon November 12th to look in the schema, 
where to
place search statistics.

This is solved because already in the mapping file: terms below search 
statistics can
be CVParams of SpectrumIdentificationList and ProteinDetectionList.

Original comment by eisena...@googlemail.com on 21 Nov 2008 at 3:54

GoogleCodeExporter commented 9 years ago
more TODOs:
- add 2nd parent for Paragon scores

Original comment by eisena...@googlemail.com on 21 Nov 2008 at 5:08

GoogleCodeExporter commented 9 years ago
[Term]
id: PI:00056
name: modification specificity rule
def: "The specificity rules for the modifications applied by the search engine
(fixed, variable)." [PSI:PI]
is_a: PI:00055 ! modification parameters

Remove  ("fixed, variable)" from the def line
Also, remove PI:00187 and PI:00188 as agreed at telecon on 2008-11-20

Original comment by dcre...@gmail.com on 23 Nov 2008 at 7:57

GoogleCodeExporter commented 9 years ago
[Term]
id: PI:00146
name: param: a ion-NH3
def: "Ion a - NH3 if a significant and fragment includes RKNQ." [PSI:PI]
is_a: PI:00066 ! ions series considered in search

Some search engines will only consider a-NH3 ions if the fragment includes RKN 
or Q
residues. Other search engines don't require the RKQN residues.
Hence in http://code.google.com/p/psi-pi/issues/detail?id=42#c21 I requested:
"TODO: Need CV terms for a-NH3 and also a - NH3 if a significant and fragment
includes RKNQ";
Either make an additional term for each of these, or remove the " if a 
significant
and fragment includes RKNQ." etc. from the defline.

Original comment by dcre...@gmail.com on 23 Nov 2008 at 7:58

GoogleCodeExporter commented 9 years ago
[Term]
id: PI:00365
name: frag: internal yb ion
is_a: PI:00066 ! ions series considered in search
is_a: PI:00221 ! fragmentation information

At the telecon on 2008-11-20, we agreed to keep ions series found and ions 
series
considered in search separate. The four new ones have been added as single 
term, but
should be separated?

Original comment by dcre...@gmail.com on 23 Nov 2008 at 7:59

GoogleCodeExporter commented 9 years ago
Terms need adding for X!Tandem and OMSSA:

scoring

xtandem: expect
xtandem: hyperscore
omssa: e_value
omssa: p_value

source file format

omssa csv file
xtandem xml file

Original comment by jensie...@gmail.com on 26 Nov 2008 at 12:02

GoogleCodeExporter commented 9 years ago
Units in terms: 

We agreed to use the following a structure for units in the CV (stolen from 
mzML):
[Term]
id: MS:1000004
name: sample mass
def: "Total mass of sample used." [PSI:MS]
xref: value-type:xsd\:float "The allowed value-type for this CV term."
is_a: MS:1000548 ! sample attribute
relationship: has_units UO:0000002 ! mass unit

An example in mzML would look like this. 

<cvParam cvRef="MS" accession="MS:1000016" name="scan time" value="5.8905"
unitCvRef="UO" unitAccession="UO:0000031" unitName="minute"/>

Which would be straightforward for semantic validation. Any term which has a
"has_units" relationship, must have a unit. However, at the moment, cvParam 
cannot
have an unitAccession or unitName or unitCvRef attribute (PropertyValue not 
included
in cvParamType) in the schema. Do I miss something or do we need a schema 
change if
we want to do it similar to mzML, which would be my favorite.

Original comment by a.bertsc...@googlemail.com on 26 Nov 2008 at 1:51

GoogleCodeExporter commented 9 years ago
Under <SpectrumIdentificationResult> I would like additional CV to help 
determine
which spectrum in an MGF file the <SpectrumIdentificationResult> came from. As
discussed previously, the SpectrumID parameter is sufficient for mzML, but 
there are
several possible indexes for an MGF file (depending on how the file was 
created). I
suggest the following 4 cv terms:
  mgf title
  mgf scans
  mgf rtinseconds
  mgf rawscans

Original comment by dcre...@gmail.com on 27 Nov 2008 at 1:58

GoogleCodeExporter commented 9 years ago
The extra mgf metadata seems like a good reason to go with nativeID instead of 
ID. In
mzML, the ID is totally arbitrary, but the nativeID is not. So if you're 
working from
a non-mzML file, it's perfectly reasonable to use nativeID but not really ID. 
The
basic nativeID for an MGF is the 0-based index into the file. If the title 
attribute
has been written in a way that the reader can parse back to a vendor's nativeID,
that's a sensible alternative.

The other attributes are pretty messy IMO because they're either not required 
to be
unique or they may encode scans or RTs from multiple acquisitions. I suggest 
them as
userParams.

Original comment by matthew....@vanderbilt.edu on 27 Nov 2008 at 2:19

GoogleCodeExporter commented 9 years ago
Interesting points Matt, and useful to have feedback from your mzML experience.
When the input to the search is an mzML file, our spectrumID attribute is the 
mzML
spectrum 'id'. This is 'easy' and a majority of people agreed at an earlier 
meeting
that if you want further information like retention time, you need to go back 
to the
mzML file.

When the input to the search engine is an mgf file, things are not so easy, 
because
different people use the title, scans and rtinseconds fields in different ways. 
Also,
as you say, there is no guarantee that any of these are unique. In a case where
someone has provided say the rtinseconds, but not a title, it would be useful to
report this and to make it clear which of the possible values is being reported.

Using a zero based index into the MGF isn't an option for the general purpose 
program
that takes a Mascot (.dat) results file and converts it to an analysisXML file
because it doesn't have the mgf file and doesn't know what the offset is.

btw, in case it's not clear, we don't currently have a nativeID attribute for 
the
<SpectrumIdentificationResult>

A common use case might be that someone has an anlysisXML document originating 
from
an mgf search and thinks a result looks 'interesting'. They then want to go 
back to
the original 'raw' data to look at it. Ideally, this should take as few steps as
possible. The only safe spectrumID value for the Mascot converter is the Mascot 
query
number (this is not what the examples use at the moment). So, the user needs the
Mascot (.dat) results file to then find the title/scan/rtinseconds and from 
that can
determine the scan number in the raw data. Seems like a long way round to me and
requires that they also have the .dat file.

We are trying to not use userParams too much in analysisXML because we are keen 
to
make the most of the cv validation tools.

So, I realise it's far from ideal, but I think what I'm proposing makes the 
best of a
far from ideal situation. Or maybe I'm missing something?

Original comment by dcre...@gmail.com on 27 Nov 2008 at 3:16

GoogleCodeExporter commented 9 years ago
changed obo according to comments 32-35
(for comment 33 I removed the " if a significant and fragment includes RKNQ." 
etc.
from the defline)
(for comment 27b I added a term "number of unmacthed peaks")

TODOs left:
- Newt.obo
- add 2nd parent for Paragon scores (Sean)
- decide comments 37 / 38

Original comment by eisena...@googlemail.com on 27 Nov 2008 at 3:25

GoogleCodeExporter commented 9 years ago
With OBOEdit 1.101 I cannot edit the obo file (OBOEdit does not show its GUI 
but the
process is there and locks the file).

Reason are the two 
relationship: has_units: UO:0000XXX ! name
lines added in revision 273

I had to delete them in a text editor, edit with OBOEdit, then add them again 
with a
text editor :-(

Original comment by eisena...@googlemail.com on 27 Nov 2008 at 3:29

GoogleCodeExporter commented 9 years ago
Hi David, sorry for the long reply to follow...

RE: spectrumID reference
I'm aware of the decision to use mzML's id as the spectrumID but I'm bringing 
the
point back up because the issue of non-mzML inputs was not discussed at the time
(AFAIK). I do not see the justification for using the id instead of the 
nativeID when
the latter must always exist for any input format whereas the former only makes 
sense
from an actual mzML file.

RE: MGF ids
Having CV terms for various format attributes is not a terrible thing, but I 
worry
because the scope is potentially much bigger than MGF->DAT->analysisXML. All of 
the
non-mzML input formats that could potentially be used to generate an 
intermediate
search result format and then converted to analysisXML will more often than not 
have
this problem. Trying to account for the various transformations of the 
identifiers
that could happen from this translation seems like a lost cause to me. The 
exception
would be very specific pipelines where the inputs and outputs are tightly 
controlled
and in those cases, userParams seem more appropriate than cvParams. Even in the 
case
of MGF->DAT->analysisXML, some of your MGF inputs may be completely lacking in 
title,
rt, and scan attributes, because they're all optional, so without an index it's 
all
screwed! :(

Just think of the combinations:
modern vendor formats: Thermo RAW, Waters RAW, WIFF, YEP, BAF, FID, MassHunter, 
Shimadzu
open formats: mzML, mzXML, mzData, MGF, DTA, MS[12], PKL, 
search result formats: pepXML, SQT, OUT, SRF, DAT, X! Tandem

As I understand it, your specific use case is: take existing DAT files that were
searched from MGFs with (unique?) title/RT/scan attributes and convert to 
analysisXML
in a way that a generic reader can directly go back to the MGF data.

The generic version of that use case is: take existing search results in any 
format
that were searched from any spectra format and convert to analysisXML in a way 
that a
generic reader can directly go back to the data in the input spectra format.

Supporting the specific use case and not the generic one makes me cringe a bit, 
which
is why I chimed in on the issue. Can't users just re-search their data and 
output
directly to analysisXML with the index attribute intact? :P

Original comment by matthew....@vanderbilt.edu on 27 Nov 2008 at 10:07

GoogleCodeExporter commented 9 years ago
Hi Matt,

Surprised to get any reply from you at all yesterday ;)
You are right, we kind of side stepped the issue of non mzML/mzData input 
formats at
the time, so it's important to hammer it out now. And yes, you are quite right, 
we
should try and support things as generally as possible. Incidentally, my 
proposal for
the CV wasn't quite as narrow as you suggest: MGF->DAT->analysisXML, it could 
be any
engine or format (or no intermediate file) in the middle.

I actually had a slightly different use case in mind - it wasn't for a generic
automated pipeline (which as you say is impossible to do reliably), but more for
'manual' inspection. So, if someone sees something 'interesting' they stand a 
chance
of finding the original data manually with as few intermediate files as 
possible.
However, you've got me thinking... suppose someone was writing a pipeline. They 
had
MGF files consistently generated by software 'X' and they were using 3 different
search engines that output analysisXML files. They would surely rather that the
identifiers in the analysisXML files were of a consistent format using CV 
rather than
differing format using user params?

I guess I'm not keen on the nativeID idea for the MGF because I couldn't see 
how it
could be implemented retrospectively without the MGF files. Requiring that 
people
re-search their data seems a little harsh. Also, there's bound to be a time 
period
before mzML is widely supported.
For .pkl and concatenated .dta files, there is no option, so I've not proposed 
any
CV. For single dta files, there is no need for CV because it's multiple files 
and we
have the filename. Excuse my ignorance, but what's MS[12]? Is there a guaranteed
unique ID for mzXML?

David

Original comment by dcre...@gmail.com on 28 Nov 2008 at 9:31

GoogleCodeExporter commented 9 years ago
We specify the source file, from which the AnalysisXML file was created and
the Spectra_Data location.

For our MPC use case both are URIs to database locations which I created like 
this:
<SourceFile id="SF1"
location="proteinscape://www.medizinisches-proteom-center.de/PSServer/Project/Sa
mple/Separation_1D_LC/Fraction_X/SpectraData/Results1"/>
<SpectraData id="LCMALDI_spectra"
location="proteinscape://www.medizinisches-proteom-center.de/PSServer/Project/Sa
mple/Separation_1D_LC/Fraction_X"/>

1.) Was it intended like that?
2.) I suggest to rename <SourceFile> to <SourceData> (in obo, too).

Original comment by eisena...@googlemail.com on 28 Nov 2008 at 1:48

GoogleCodeExporter commented 9 years ago
OK, I can see the argument of skipping the MGF and going straight to the native
spectrum or spectra in a controlled manner, but I'll make a slightly more 
generic
proposal. For example, many input formats may provide retention time 
information as
an alternative way of identifying a scan, so that should be a generic concept. 
We
already have it in the PSI-MS CV of course: the "scan time" term.

So...now we are back to the original discussion about including mzML attributes 
in
analysisXML (the other MGF attributes also probably belong in the MS CV, perhaps
under an "alternative identifier" category). I'm not sure if your use case came 
up at
the time though - I seem to recall it was mainly considered as a way of 
forwarding
commonly-used attributes and not as an alternative identifier. So I would 
support the
alternative identifier approach for TITLE, SCANS, and RAWSCANS (I could not 
find any
documentation on the last one?), but not the retention time(s). That should 
re-use
the existing term, multiple times if necessary.

Also consider the use case of running a search straight from a native format. 
In such
a case, the nativeID is well defined (and can be adopted now, without using 
mzML as
an intermediate if that is not yet desired), the spectrumID is not. I think 
this use
case is perfectly legitimate too, we do it often with MyriMatch which can read
whatever formats pwiz can (currently Thermo RAW w/ Xcalibur, Waters RAW w/ 
MassLynx,
Bruker/Agilent YEP, Bruker BAF/FID). And when we need to go back and view a 
spectrum
in either the raw data or in an associated mzML, the best bet is to use the 
nativeID
because it's well defined in either case.

Original comment by matthew....@vanderbilt.edu on 28 Nov 2008 at 2:18

GoogleCodeExporter commented 9 years ago
MS2 is essentially concatenated DTAs with added metadata to avoid the problem 
you
mentioned about concatenated DTAs and PKLs. :) MS1 is equivalent for MS1 data.
mzXML's scan attribute (xsd:nonZeroInteger) is required to be unique but 
obviously it
can't always be used to track down the original nativeID easily. There is a
nativeScanOrigin element meant to be able to do that, but it's not used 
frequently.

Original comment by matthew....@vanderbilt.edu on 28 Nov 2008 at 2:22

GoogleCodeExporter commented 9 years ago
MPC_example.axml is an example using a protein decoy approach.
It lists ALL proteins of a ProteinDetectionAnalysis and reports the "local FDR" 
for
each protein in the score-sorted list.
[BTW: In the CV we have also "local FDR" for peptides and "pep:global FDR" and
"prot:global FDR", all as result values of an analysis.]

We need an INPUT parameter "prot:FDR threshold" or "pep:FDR threshold" 
(probably in
the branch "search input details"/"quality estimation method"), if we want to 
report
only the proteins below a specified FDR.

Original comment by eisena...@googlemail.com on 28 Nov 2008 at 3:11

GoogleCodeExporter commented 9 years ago
OK, I assume that we can now agree on adding:
  mgf title
  mgf scans
  mgf rawscans
(The rawscans will be in the next release of Mascot, so isn't documented yet)

Yes, I totally agree that the retention time (possibly multiple times as you 
say)
should be a generic term. 
For better or worse, we decided after much discussion not to share CV with 
mzML, so
we'd need to have our own term for retention time. In fact, it looks as though 
we
already have PI:00114 (although this can't currently be used because it's not 
in the
mapping file). Even though we aren't sharing CV, we should at least use the 
same name
and description as in mzML: 

[Term]
id: MS:1000016
name: scan time
def: "The time that an analyzer started a scan, relative to the start of the 
run."
[PSI:MS]
xref: value-type:xsd\:float "The allowed value-type for this CV term."
is_a: MS:1000503 ! scan attribute
relationship: has_units UO:0000003 ! time unit

Any objections?

We'd not considered the use case of of running a search straight from a native
format... However, are you doing some sort of peak detection? Are you merging
together any spectra - i.e. will there be multiple nativeIDs for each spectrum? 
Might
the same nativeID be used for multiple spectra. (I notice that nativeIDs don't 
have
to be unique in mzML.)

-David

Original comment by dcre...@gmail.com on 28 Nov 2008 at 3:48

GoogleCodeExporter commented 9 years ago
Can you point me to the discussion about (not) sharing CV? That seems a bit 
crazy to
me (and contrary to the PSI CV guidelines?). I'm sure there are reasons though, 
I
just want to see them. :)

All of these terms are also things that would potentially be in an mzML file 
created
from MGF (just like the Thermo filter line may be included from Thermo files), 
so
that's why I suggested they all go in the MS CV. MGF is after all a generic MS
format, not necessarily specific to proteomics even. :)

NativeIDs in mzML must be unique. You just had to bring up merged spectra 
didn't you?
;) It gets pretty painful and hazy when the original acquisitions and their 
merged
forms are kept in the same file. There's 2 issues there:
1) support representing both the merged spectra and the separate acquisitions as
independent spectra? or only support one or the other
2) if yes to 1, and nativeID must be unique, there are several possible 
solutions:
  a) just taking the first acquisition's nativeID won't be unique, so we extend the
nativeID syntax to support either ranges (Thermo: "controller=0 scan=[2,10]") or
lists of nativeIDs ("controller=0 scan=2,controller=0 scan=15,controller=0 
scan=50")
or perhaps a combination of both
  b) use a special convention for nativeIDs of merged spectra that indicates to a
semantic validator that the nativeID is irrelevant and only the acquisitionList 
is
important; e.g. nativeID="merged" (since nativeID is string and not xsd:ID, it 
won't
be invalid syntax)

Really there's no nativeID for a merged spectrum, so anything we come up with 
is a
workaround.

Finally, several vendor formats allow peak picking straight out of their API, 
namely
Thermo, ABI, and Bruker. So for these formats MyriMatch works straight off by 
just
asking for the centroids. For other formats, we don't have an external peak 
picker
yet (in ProteoWizard) but we will "Real Soon Now." And no, when reading 
straight from
the vendor file we don't merge, so nativeIDs are direct.

Original comment by matthew....@vanderbilt.edu on 28 Nov 2008 at 4:27

GoogleCodeExporter commented 9 years ago
Um. I can't find the relevant minutes that describe why we aren't importing the 
MS
CV... I recollect that part of the discussion was along the lines that the 
structure
would be different between the two and this could become a logistical nightmare.
Also, that the mzML CV is not yet stable and trying to get the two to work 
together
in a timely manner wasn't considered to be feasible. I'm no expert with CV, so 
am
probably not the best person to answer this. Guess we (both groups) may have to
defend this decision.

btw, there's no constraint for nativeID being unique in the mzML schema so I 
assumed
it didn't have to be unique.

I think this is starting to get a little beyond the scope of analysisXML. 'All' 
we
want is to be able to say which spectrum a result relates to, and we can
realistically only report back whatever is fed into the search engine. 

Is the following 'good enough' for all cases (even if we aren't 100% happy with 
it?):

The spectrumID attribute in analysisXML instance documents must be unique.

spectrumID: for mzML   files must be the <spectrum id> and is enforced as 
unique in
mzML schema
            for mzData files, <spectrum id> value, should be unique, but not enforced
in mzData schema
            for mzXML  scan attribute, should be unique, but not enforced in mzXML
schema?
            Other files, any unique value, possibly generated by the search engine.

Add the following optional CV terms:

  scan time  (maxOccurs="unbounded" for merged spectra)
  nativeID   (maxOccurs="unbounded" for merged spectra)
  mgf title
  mgf scans
  mgf rawscans

So MyriMatch would presumably report the nativeID.

Does this sound reasonable?

Original comment by dcre...@gmail.com on 28 Nov 2008 at 5:42