Encoding spectrum identifiers for MGF, PKL and DTA files

GoogleCodeExporter commented 9 years ago

Section 5.1.4 of the specification states:

A <SpectrumIdentificationResult> is linked to the source spectrum (in an
external file) from which the identifications are made by way of a
reference in the spectrumID attribute and via the <SpectraData> element
which stores the URL of the file in the location attribute. It is
advantageous if there is a consistent system for identifying spectra in
different file formats. The following table is implemented in the PSI-MS CV
for providing consistent identifiers for different spectrum file formats:

...

MS:1000774  multiple peak list nativeID format: 
index=xsd:nonNegativeInteger    
Used for conversion of peak list files with multiple spectra, i.e. MGF,
PKL, merged DTA files. Index is the spectrum number in the file, starting
from 0.

From the mzML doc for <spectrum id="">
The native identifier for a spectrum. For unmerged native spectra or
spectra from older open file formats, the format of the identifier is
defined in the PSI-MS CV and referred to in the mzML header. External
documents may use this identifier together with the mzML filename or
accession to reference a particular spectrum.

As discussed (at length!) here: 
http://code.google.com/p/psi-pi/issues/detail?id=42#c37
it's not currently possible to write a converter for Mascot (and possibly
other search engines), to output this ID. Future releases of Mascot could
support this, but it won't help for older Mascot searches or other engines.
MGF files optionally contain other tags that could be used as described in
the reference above.

btw, we don't currently have MS:1000774 or any related term in any of the
examples which will make it pretty difficult for any importer to make
references back to the original raw data. So, I think that the mapping
needs to be changed to require a child of
MS:1000767 at the SpectraData/fileFormat level.

For reference:

[Term]
id: MS:1000767
name: native spectrum identifier format
def: "Describes how the native spectrum identifiers are formated." [PSI:MS]
synonym: "nativeID format" EXACT []
relationship: part_of MS:1000577 ! raw data file

If this is the case, we could (for Mascot) add another term:

[Term]
id: MS:100xxxxx
name: Mascot Query number 
def: "index=xsd:nonNegativeInteger" [PSI:MS]
comment: The spectrum number in a Mascot results file, starting from 1.
is_a: MS:1000767 ! native spectrum identifier format

So, this would look like:

<SpectraData location="file:///est_coding_test.mgf" id="SD_1">
  <fileFormat>
    <cvParam accession="MS:1001062" name="Mascot MGF file" cvRef="PSI-MS" />
    <cvParam accession="MS:100xxxxx" name="Mascot Query number"
cvRef="PSI-MS" />
  </fileFormat>
</SpectraData>

And additionally, we can have:
  <SpectrumIdentificationResult id="query=1"
    <cvParam accession="MS:1000796" name="spectrum title" value="..."
cvRef="PSI-MS" />
    <cvParam accession="MS:1000797" name="peak list scans" value="..."
cvRef="PSI-MS" />
    <cvParam accession="MS:1000798" name="peak list raw scans" value="..."
cvRef="PSI-MS" />
    <cvParam accession="MS:1001114" name="retention time(s)" value="..."
cvRef="PSI-MS" />

(These 4 CV items are already present)

Trouble is, I can already hear howls of complaint from the mzML group with
the "query number" CV term.
Also, we need to determine what the other search engines can do and if a
term is required for them.

There's obviously no problem with the documentation or example for ids when
the input file was an mzML file.

Original issue reported on code.google.com by dcre...@gmail.com on 13 Aug 2009 at 12:07

GoogleCodeExporter commented 9 years ago

It doesn't work like that, because only one CVParam is allowed under 
<fileFormat>.

Isn't the spectrumID type CV term better placed as CVParam of
SpectrumIdentificationResult? 
(That would also allow a mix of several spectrum file types).

Then only the mapping file had to be changed ;-)

Original comment by eisena...@googlemail.com on 13 Aug 2009 at 4:31

Added labels: Milestone-Release1.0
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Too much bloat to have to repeat it for each spectrum?
If there are multiple files, there needs to be multiple <SpectraData> items and
therefore multiple <fileFormats>. 
I can't see a justification for using different indexes within the same file?

Original comment by dcre...@gmail.com on 13 Aug 2009 at 5:20

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I fully agree reg. the "bloat"!

We could allow more cvParams (maxOccurs="unbounded") under 
SpectraData/fileFormat,
that would be a schema change in FuGElight...

Original comment by eisena...@googlemail.com on 13 Aug 2009 at 6:12

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

To pick at an old scab, I'll quote the other issue:

> Using a zero based index into the MGF isn't an option for the general purpose 
program
> that takes a Mascot (.dat) results file and converts it to an analysisXML file
> because it doesn't have the mgf file and doesn't know what the offset is.

> A common use case might be that someone has an anlysisXML document 
originating from
> an mgf search and thinks a result looks 'interesting'. They then want to go 
back to
> the original 'raw' data to look at it. Ideally, this should take as few steps 
as
> possible. The only safe spectrumID value for the Mascot converter is the 
Mascot query
> number (this is not what the examples use at the moment). So, the user needs 
the
> Mascot (.dat) results file to then find the title/scan/rtinseconds and from 
that can
> determine the scan number in the raw data. Seems like a long way round to me 
and
> requires that they also have the .dat file.

Is it not possible for the DAT converter to access the input MGF and translate 
from
query number to MGF index automagically? Somebody or something must have access 
to
the MGF or else an identifier for the spectrum is rather useless.

Or are you more concerned with trying to map back from the mzIdentML to the DAT
query? I can understand in that case that once you've done the automagic 
conversion
from query # to MGF index, it would be tricky to get back to the query number. 
This
seems analogous to WIFF->mzXML conversion losing the critical cycle and 
experiment
numbers which is why we gave mzXML the "scan number only" nativeID format. I'm 
not
sure it makes sense to do that for identifiers which don't store the raw 
spectral
data; how would it apply to other intermediate result formats, e.g. SRF, X! 
Tandem,
SQT, OUT, pepXML? I don't know about SRF, but all the others suffer the same 
problem
as DAT without the aforementioned automagic conversion (usually by parsing the
filename/attribute/TITLE a certain way).

Original comment by matt.cha...@gmail.com on 13 Aug 2009 at 7:28

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

> Or are you more concerned with trying to map back from the mzIdentML to the 
DAT
> query? 
No, that wasn't something we'd really considered, but I guess it could be 
useful,  so
thanks for mentioning it. The assumption is that you'd need the mzIdentML file 
*or*
the .DAT file, not both.

> Is it not possible for the DAT converter to access the input MGF and 
translate from
> query number to MGF index automagically? Somebody or something must have 
access to
> the MGF or else an identifier for the spectrum is rather useless.
Yes, (maybe!) but not in a general way. We want a script that can run on all 
Mascot
servers (including our public web site) and for searches that were possibly done
years ago. The mgf file is normally on a different computer and may have been 
lost.
If there's no longer an MGF file, the index is of course useless but you may 
still
want an mzIdentML file. For a pkl or merged dta file, there is no reliable way 
to get
back to generating an index.  
In the next version of Mascot, we intend to save the index into the mgf/pkl/dta 
file,
so it shouldn't be an issue for new searches.

Original comment by dcre...@gmail.com on 14 Aug 2009 at 8:57

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Following the telecon yesterday, we've agreed that rather than adding a "native
spectrum identifier format" CV item to the <fileFormat> element, we should use 
a new
<spectrumIDFormat> element. For example:

  <SpectraData location="file:///est_coding_test.mgf" id="SD_1">
    <fileFormat>
      <cvParam accession="MS:1001062" name="Mascot MGF file" cvRef="PSI-MS" />
    </fileFormat>
    <spectrumIDFormat>
      <cvParam accession="MS:1000774" name="multiple peak list nativeID format"
cvRef="PSI-MS" />
    </spectrumIDFormat>
  </SpectraData>

This requires the following addition to the mapping file:

CvMappingRule id="SpectraDataSpectrumIDFormat_rule"
cvElementPath="/mzIdentML/DataCollection/Inputs/SpectraData/spectrumIDFormat/cvP
aram/@accession"
requirementLevel="MUST" scopePath="" cvTermsCombinationLogic="OR">
<CvTerm termAccession="MS:1000767" useTermName="false" useTerm="false"
termName="native spectrum identifier format"
isRepeatable="false" allowChildren="true" cvIdentifierRef="MS" />
</CvMappingRule>

And the the following additional CV items:

[Term]
id: MS:1001526
name: spectrum from database nativeID format
def: "databasekey=xsd:Long" [PSI:MS]
comment: A unique identifier of a spectrum stored in a database (e.g. a PRIMARY 
KEY
identifier).
is_a: MS:1000767 ! native spectrum identifier format

[Term]
id: MS:1001527
name: Proteinscape spectra
def: "Spectra from Bruker/Protagen Proteinscape database." [PSI:MS]
is_a: MS:1000560 ! mass spectrometer file format

[Term]
id: MS:100xxxxx
name: Mascot query number
def: "index=xsd:nonNegativeInteger" [PSI:MS]
comment: The spectrum (query) number in a Mascot results file, starting from 1.
is_a: MS:1000767 ! native spectrum identifier format
is_a: MS:1001405 ! spectrum identification result details

(Should this be 2 separate CV items?)

and add an is_a to
id: MS:1001114
name: retention time(s)
...
is_a: MS:1001405 ! spectrum identification result details

Original comment by dcre...@gmail.com on 14 Aug 2009 at 3:54

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Can you just reuse "multiple peak list nativeID format" instead of making a new 
term?
It's the same index=xsd:nonNegativeInteger.

Also, the database nativeIDs deserve more discussion. Are those databases actual
native sources, like the Oracle database used by ABI's 4000 and 5000 series 
TOF-TOFs?
Or are they importing native spectra into the database?

Original comment by matt.cha...@gmail.com on 14 Aug 2009 at 4:03

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

To clarify, I think in retrospect some of the vendor nativeIDs that have the 
same
"scan=xxx" definition should probably have been made as synonyms of "scan number
only" instead of making separate terms. The same would apply to "index=xxx" 
terms.

Original comment by matt.cha...@gmail.com on 14 Aug 2009 at 4:05

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

If we use the same "multiple peak list nativeID format", then an importer can't 
tell
whether the index is for the query number or an index into the original 
mgf/pkl/dta file?
The comment says 'i.e.' and not 'e.g.' so strictly speaking it's just 
restricted to
those three file types.

The database term isn't for a 4000/5000 series database. It's intended for
ProteinScape or any 'generic' rdb where spectra are stored. The format for 
MS:1001480
is pretty specific to the tof-tofs.

Original comment by dcre...@gmail.com on 14 Aug 2009 at 4:36

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Do these non-native databases really not store the necessary information to get 
back
to the native spectrum? That would be surprising. If they do, I think the thing 
to do
is use the original nativeID. If not, then some other term should be 
appropriate. A
generic "database nativeID" term is silly. For one thing, certainly not all 
databases
will use xsd:long as a primary key! The nativeID string would make a sensible
nativeID as well (albeit slower, but still reasonable). So for a database 
format that
uses a single integer identifier, maybe mzData's term "spectrum identifier" with
"spectrum=xsd:nonNegativeInteger" would be appropriate.

For the query number, what would the file format term be? It wouldn't seem 
right for
it to be MGF if you can't map the query number back to the MGF index. Instead,
perhaps DAT should be a valid term for file format and that could easily 
indicate
that the id has been rendered useless for getting back to the raw spectrum 
because
it's been disconnected from the nativeID?

Original comment by matt.cha...@gmail.com on 14 Aug 2009 at 4:50

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Someone else (Martin) may have comments about the database part. Maybe it 
should be
specific to ProteinScape rather than a generic database. However, I think I 
agree
with you because the use case here would typically be an export from 
ProteinScape to
someone who doesn't have that particular ProteinScape database, but may have 
the peak
list and/or raw data file.

For the query number, I partially agree that the file format for the 
<SpectraData>
should be .DAT file. 
For an mgf file, there is of course a way to get back to the spectrum in the 
raw data
file because, in this case, we save the title, scans, retention time etc. as 
separate
cv items in the mzIdentML file. However, as you say, the index itself doesn't 
help
you get back in one go.
The typical use case will be that the consumer of the mzIdentML file may also 
have
the mgf/pkl/dta file and/or the raw data, but not the .dat file.
One problem with using the .dat file as the filename is that you lose the 
information
about the filename of the mgf/pkl/dta file. Maybe that filename just has to be 
a user
param? Also, we already have the name of the .dat file a few lines above as:
  <DataCollection>
    <Inputs>
      <SourceFile location="file:///../data/F001350.dat" id="SF_1" >
        <fileFormat>
          <cvParam accession="MS:1001199" name="Mascot DAT file" cvRef="PSI-MS" />

So, I don't have too strong a feeling either way about what the fileFormat of 
the
<SpectraData> should be.

However, the description for "multiple peak list nativeID format" would be 
rather
misleading:
  Index is the spectrum number in the file, starting from 0
So I still think that there should be a new CV term for the query number even 
if we
change the 'i.e.' to 'e.g.'

Original comment by dcre...@gmail.com on 14 Aug 2009 at 5:59

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

One more thing to agree on...
If the input file is an mzML file, then the fileFormat is:
MS:1000584, name: mzML file

but should the new <spectrumIDFormat> be:
MS:1000767, name: native spectrum identifier format
or the child term specified in the actual mzML file, for example
MS:1000768, name: Thermo nativeID format

Original comment by dcre...@gmail.com on 17 Aug 2009 at 9:29

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I think we should have a term specifically for mzML where the value is exactly
whatever is contained within the <spectrum id="[ ]" > attribute in mzML. 
mzIdentML
converters shouldn't need to worry about where the ID format came from further 
back
in the pipeline?

Original comment by andrewro...@googlemail.com on 17 Aug 2009 at 9:41

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I agree. So, we have 3 choices:
 1. use MS:1000767 (not especially intuitive)
 2. make a new term that is a child of MS:1000767 
 3. make a new term that is a child of ???

Problem with #2 is that it is a bit recursive and could be confusing to the 
mzML people?

Original comment by dcre...@gmail.com on 17 Aug 2009 at 10:53

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

The nativeID is transitive: transcoding through mzML does not change the 
nativeID
format. You should use the same nativeID format that the input document used 
because
that's the only reasonable way you're going to preserve the meaning of "sample=1
period=1 cycle=123 experiment=2".

Original comment by matt.cha...@gmail.com on 17 Aug 2009 at 12:45

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Following this argument through would mean that we would then need to specify 
the
fileFormat as a .wiff file rather than the mzML file which seems crazy. We 
aren't
including much (any) information about how the peak list was produced from the 
raw
data, so you _have_ to go back to the mzML file for all that good stuff. We 
just want
an index into the mzML file. It could be pigs_might_fly_01, pigs_might_fly_02 
etc.
for all we care!

Original comment by dcre...@gmail.com on 17 Aug 2009 at 1:12

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

No. File format is not transitive, nativeID is. Use the nativeID, we already 
had this
discussion. NativeID works for all input file types; "pigs_might_fly_02" used 
to work
for mzML but in 1.1 there is only nativeID.

Original comment by matt.cha...@gmail.com on 17 Aug 2009 at 2:35

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

In order to be able to delete an intermediate mzML and go straight from 
mzIdentML to
a native file, it would be nice to have sourceFile/dataProcessing info 
forwarded from
the mzML to the mzIdentML, but that might be overkill.

Original comment by matt.cha...@gmail.com on 17 Aug 2009 at 2:37

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Let me put it another way: the mzML file will have an id that, for a WIFF file, 
MUST
look like "sample=x period=x cycle=x experiment=x". What reason is there to 
call that
an "mzML" nativeID instead of a WIFF nativeID?

Original comment by matt.cha...@gmail.com on 17 Aug 2009 at 2:38

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

In the MPC use case I cannot use the MGF "ID" because the mgf is "lost", once 
the
spectra data is imported into ProteinScape. (Proteinscape sends data to search
engines by creating temporary MGFs not visible to the user).

I agree, that we could use a generic database ID, as the ID itself is not 
accessible
to the users. I suggested the "specialised" ID as we have those specialised ID 
in the
file case also (Bruker BAF, Bruker YEP).

The main argument now should be, that we need a "release" of the example files.

Original comment by eisena...@googlemail.com on 17 Aug 2009 at 2:39

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I think you misunderstood me. Again taking the WIFF example, if you read a 
spectrum
from a WIFF file into ProteinScape, do you keep track of the
sample/period/cycle/experiment, or do you lose that information by putting it 
in the
database?

Original comment by matt.cha...@gmail.com on 17 Aug 2009 at 2:51

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

In the future there might hopefully be no "original" file, because all 
instruments
produce directly mzML (in which case we need a "mzML native ID" term).

Original comment by eisena...@googlemail.com on 17 Aug 2009 at 2:54

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

That's a very good point. I don't recall that we've addressed the issue of 
synthetic
spectra in the mzML group yet. They may either be synthesized by the acquisition
software or by a spectral decoy/theoretical generation program. If the mzML is 
the
original file, then it seems to me that "spectrum=xsd:nonNegativeInteger" would 
be
all that was necessary, because concepts like period, cycle, controller, 
function,
etc. are all abstracted away.

Original comment by matt.cha...@gmail.com on 17 Aug 2009 at 3:01

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

From my experience with ProteinScape tracking that information cannot be 
guaranteed
at the moment.
For MGFs for example you sometimes (depending on the tool the MGF was created 
with)
have something like "Cmpd 315, +MSn(931.5), 46.2 min", i.e. rather unstructured
information.

BTW: I agree, that the info SHOULD be there (but in fact isn't).

Original comment by eisena...@googlemail.com on 17 Aug 2009 at 3:02

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Comment 19 isn't strictly true is it? If we have merged spectra, then the 
format in
the mzML is:
<spectrum id="merged=1"> 
with no reference to sample, period, cycle and expt? And it could be merged from
multiple files? This is very mzML specific, so I really think we should be 
referring
back to the mzML file and calling this an mzML id.
(btw, I couldn't see anything in the mzML doco about merged spectra).

Original comment by dcre...@gmail.com on 17 Aug 2009 at 3:12

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Merged spectra should (must?) have the spectra that went into the merge in the
scanList, and the WIFF nativeIDs would be there instead. If it's not documented,
we'll have to fix that. The merged case is an exception that had to be made in 
an
otherwise quite robust system. The fun really begins when you look at including 
UV
and MS spectra from Bruker instruments in the same file, where the UV and MS 
spectra
have different nativeID formats.

Original comment by matt.cha...@gmail.com on 17 Aug 2009 at 3:16

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Exactly. And we aren't carrying the wiff native IDs from the scan list from the 
mzML
through to the mzIdentML file, so the _only_ thing that we can do safely is to 
have a
new CV term for spectrumIDFormat which specifies that it is an mzML identifier. 
And
we are back to the choices I listed in comment 14. Any preference Matt? (Or do 
you
still disagree?)

Original comment by dcre...@gmail.com on 17 Aug 2009 at 4:19

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

We seem to have got held up here on an exceptionally minor issue :-)

The value of the spectrumID is the same whichever way they are imported from 
mzML,
the only difference is how much work there is for an mzIdentML exporter to work 
out
what format the IDs are.

I can see that it's very easy to say, the input to the search is an mzML 
format, so
the spectrumID format is mzML nativeID. If it is trivial to examine an mzML 
format
and work out the required CV term for <spectrumIDformat> then I don't really 
care if
we use the same CV term - although to me this implies a semantics to the 
identifier
that the search engine has no interest in.

If it is sometimes not clear or difficult then I'd rather have a new term for 
mzML
nativeID.

Either way, a quick decision would be good...

Original comment by andrewro...@googlemail.com on 17 Aug 2009 at 4:21

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

"merged=<index>" is a valid id for ANY nativeID format. So it's perfectly safe 
to
carry the WIFF nativeID for the unmerged scans and carry the merged id for the 
merged
scans.

Original comment by matt.cha...@gmail.com on 17 Aug 2009 at 4:22

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

As interesting as this is, I've no time to carry on with the discussion. 
If you really, really don't want us to add a new cv term, please just say what 
we
should use for this case and we'll go along with your recommendation:

<sourceFileList count="2">
  <sourceFile id="SF1" name="*" location="file:... ">
    . . .
    <cvParam cvRef="MS" accession="MS:1000769" name="Waters nativeID format" /> 
  </sourceFile>
  <sourceFile id="SF2" name="xxx.wiff" location="file:...">
    . . .
    <cvParam cvRef="MS" accession="MS:1000770" name="WIFF nativeID format" /> 
  </sourceFile>
</sourceFileList>

<spectrum id="merged=1">

Thanks.

Original comment by dcre...@gmail.com on 17 Aug 2009 at 4:55

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I do not see a benefit to adding an mzML nativeID term, and losing an easy way 
to
identify (programmatically) the id format's defined syntax is a detriment to 
me, even
at the search engine and protein assembly steps.

As for the case where two source files are contributing to the mzIdentML: 
spectrum
must have a sourceFileRef to make the id unique no matter what option we 
choose! I
could just as easily ask you:

<sourceFileList count="2">
  <sourceFile id="SF1" name="foo.RAW" location="file:... ">
    . . .
    <cvParam cvRef="MS" accession="MS:1000769" name="Thermo nativeID format" /> 
  </sourceFile>
  <sourceFile id="SF2" name="bar.RAW" location="file:...">
    . . .
    <cvParam cvRef="MS" accession="MS:1000770" name="Thermo nativeID format" /> 
  </sourceFile>
</sourceFileList>

<spectrum id="controllerType=0 controllerNumber=1 scan=123"> <-- both files 
have this
id -->

Original comment by matt.cha...@gmail.com on 17 Aug 2009 at 5:05

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Sorry David, I misunderstood you at first; do I understand correctly that you 
mean
that sourceFileList is what's in the mzML? Having different nativeIDs (like the
Bruker BAF/U2 case) does make this nastier, but having different sourceFiles 
means it
was nasty to begin with - as my last post should illustrate. You can't simply 
point
to the mzML with an id because that id is only unique in the mzML when combined 
with
the sourceFileRef. Which leads me to notice that in the mzML schema we don't
constrain the ids with the sourceFileRef (which can easily be done).

Original comment by matt.cha...@gmail.com on 17 Aug 2009 at 5:15

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Oh dear, that's more of a problem... we were relying on this statement in the 
mzML
documentation being correct:

"External documents may use this identifier together with the mzML filename or
accession to reference a particular spectrum."

I propose that we release mzIdentML 1.0 without support for mzML input files 
and try
and resolve this in the next version.

Original comment by dcre...@gmail.com on 17 Aug 2009 at 5:16

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Currently there is no vendor scenario where there will multiple source files 
with
duplicate ids. Waters files have different files for each function, but the 
function
number is included in the nativeID. The BAF/U2 case is irrelevant because AFAIK 
U2 is
just for LC spectra, which mzIdentML doesn't seem intended for. It is not valid 
for a
user to combine multiple runs together into a single mzML file, and that's the 
only
case I can think of that would result in duplicate ids. I don't think you need 
to say
mzIdentML doesn't support mzML. The only ones you might not be able to support 
in 1.0
are future sourceFile combinations, not current vendor conversions.

Original comment by matt.cha...@gmail.com on 17 Aug 2009 at 5:28

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Message from Matt on psidev-ms-dev for further clarification:
Right now the schema does guarantee that spectrum::id is unique. I believe this 
won't
be a problem in the future because we will always add a unique axis to the 
nativeID
(like Waters RAW has function=x instead of "process=x scan=x" and pointing to 
the 
sourceFile). And since we currently don't have any cases where there are 
different
nativeID formats for MS spectra (BAF/U2 is the only case and that's MS/UV), I 
think
mzIdentML is fine to say it supports mzML.

Original comment by dcre...@gmail.com on 19 Aug 2009 at 12:52

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Final decisions on this, which have been included in version 1.0:
 - Added required <spectrumIDFormat> element. Exactly one CV term required.
 - Added new CV term MS:1001528, name: Mascot query number 
   (for versions of Mascot prior to version 2.3)
 - Added new CV term MS:1001530, name: mzML unique identifier, 
 - Added new CV term MS:1001526, name: spectrum from database nativeID format

Whilst it's clear from the above discussion that there wasn't 100% agreement 
with
these decisions, this was the majority view.

Original comment by dcre...@gmail.com on 19 Aug 2009 at 1:10

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Original comment by dcre...@gmail.com on 20 Aug 2009 at 8:45

Changed state: Fixed
Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Should retention time be obsoleted and replaced by scan start time? I find it 
odd that mzML and mzIdentML use different terms to mean the same thing. For 
mzML we agreed that retention/elution time is a peptide/compound chromatography 
property, not a spectrum property.

Original comment by matt.cha...@gmail.com on 15 Apr 2011 at 10:59

Added labels: ****
Removed labels: ****

mwalzer / psi-pi

Encoding spectrum identifiers for MGF, PKL and DTA files #53