Closed GoogleCodeExporter closed 9 years ago
It doesn't work like that, because only one CVParam is allowed under
<fileFormat>.
Isn't the spectrumID type CV term better placed as CVParam of
SpectrumIdentificationResult?
(That would also allow a mix of several spectrum file types).
Then only the mapping file had to be changed ;-)
Original comment by eisena...@googlemail.com
on 13 Aug 2009 at 4:31
Too much bloat to have to repeat it for each spectrum?
If there are multiple files, there needs to be multiple <SpectraData> items and
therefore multiple <fileFormats>.
I can't see a justification for using different indexes within the same file?
Original comment by dcre...@gmail.com
on 13 Aug 2009 at 5:20
I fully agree reg. the "bloat"!
We could allow more cvParams (maxOccurs="unbounded") under
SpectraData/fileFormat,
that would be a schema change in FuGElight...
Original comment by eisena...@googlemail.com
on 13 Aug 2009 at 6:12
To pick at an old scab, I'll quote the other issue:
> Using a zero based index into the MGF isn't an option for the general purpose
program
> that takes a Mascot (.dat) results file and converts it to an analysisXML file
> because it doesn't have the mgf file and doesn't know what the offset is.
> A common use case might be that someone has an anlysisXML document
originating from
> an mgf search and thinks a result looks 'interesting'. They then want to go
back to
> the original 'raw' data to look at it. Ideally, this should take as few steps
as
> possible. The only safe spectrumID value for the Mascot converter is the
Mascot query
> number (this is not what the examples use at the moment). So, the user needs
the
> Mascot (.dat) results file to then find the title/scan/rtinseconds and from
that can
> determine the scan number in the raw data. Seems like a long way round to me
and
> requires that they also have the .dat file.
Is it not possible for the DAT converter to access the input MGF and translate
from
query number to MGF index automagically? Somebody or something must have access
to
the MGF or else an identifier for the spectrum is rather useless.
Or are you more concerned with trying to map back from the mzIdentML to the DAT
query? I can understand in that case that once you've done the automagic
conversion
from query # to MGF index, it would be tricky to get back to the query number.
This
seems analogous to WIFF->mzXML conversion losing the critical cycle and
experiment
numbers which is why we gave mzXML the "scan number only" nativeID format. I'm
not
sure it makes sense to do that for identifiers which don't store the raw
spectral
data; how would it apply to other intermediate result formats, e.g. SRF, X!
Tandem,
SQT, OUT, pepXML? I don't know about SRF, but all the others suffer the same
problem
as DAT without the aforementioned automagic conversion (usually by parsing the
filename/attribute/TITLE a certain way).
Original comment by matt.cha...@gmail.com
on 13 Aug 2009 at 7:28
> Or are you more concerned with trying to map back from the mzIdentML to the
DAT
> query?
No, that wasn't something we'd really considered, but I guess it could be
useful, so
thanks for mentioning it. The assumption is that you'd need the mzIdentML file
*or*
the .DAT file, not both.
> Is it not possible for the DAT converter to access the input MGF and
translate from
> query number to MGF index automagically? Somebody or something must have
access to
> the MGF or else an identifier for the spectrum is rather useless.
Yes, (maybe!) but not in a general way. We want a script that can run on all
Mascot
servers (including our public web site) and for searches that were possibly done
years ago. The mgf file is normally on a different computer and may have been
lost.
If there's no longer an MGF file, the index is of course useless but you may
still
want an mzIdentML file. For a pkl or merged dta file, there is no reliable way
to get
back to generating an index.
In the next version of Mascot, we intend to save the index into the mgf/pkl/dta
file,
so it shouldn't be an issue for new searches.
Original comment by dcre...@gmail.com
on 14 Aug 2009 at 8:57
Following the telecon yesterday, we've agreed that rather than adding a "native
spectrum identifier format" CV item to the <fileFormat> element, we should use
a new
<spectrumIDFormat> element. For example:
<SpectraData location="file:///est_coding_test.mgf" id="SD_1">
<fileFormat>
<cvParam accession="MS:1001062" name="Mascot MGF file" cvRef="PSI-MS" />
</fileFormat>
<spectrumIDFormat>
<cvParam accession="MS:1000774" name="multiple peak list nativeID format"
cvRef="PSI-MS" />
</spectrumIDFormat>
</SpectraData>
This requires the following addition to the mapping file:
CvMappingRule id="SpectraDataSpectrumIDFormat_rule"
cvElementPath="/mzIdentML/DataCollection/Inputs/SpectraData/spectrumIDFormat/cvP
aram/@accession"
requirementLevel="MUST" scopePath="" cvTermsCombinationLogic="OR">
<CvTerm termAccession="MS:1000767" useTermName="false" useTerm="false"
termName="native spectrum identifier format"
isRepeatable="false" allowChildren="true" cvIdentifierRef="MS" />
</CvMappingRule>
And the the following additional CV items:
[Term]
id: MS:1001526
name: spectrum from database nativeID format
def: "databasekey=xsd:Long" [PSI:MS]
comment: A unique identifier of a spectrum stored in a database (e.g. a PRIMARY
KEY
identifier).
is_a: MS:1000767 ! native spectrum identifier format
[Term]
id: MS:1001527
name: Proteinscape spectra
def: "Spectra from Bruker/Protagen Proteinscape database." [PSI:MS]
is_a: MS:1000560 ! mass spectrometer file format
[Term]
id: MS:100xxxxx
name: Mascot query number
def: "index=xsd:nonNegativeInteger" [PSI:MS]
comment: The spectrum (query) number in a Mascot results file, starting from 1.
is_a: MS:1000767 ! native spectrum identifier format
is_a: MS:1001405 ! spectrum identification result details
(Should this be 2 separate CV items?)
and add an is_a to
id: MS:1001114
name: retention time(s)
...
is_a: MS:1001405 ! spectrum identification result details
Original comment by dcre...@gmail.com
on 14 Aug 2009 at 3:54
Can you just reuse "multiple peak list nativeID format" instead of making a new
term?
It's the same index=xsd:nonNegativeInteger.
Also, the database nativeIDs deserve more discussion. Are those databases actual
native sources, like the Oracle database used by ABI's 4000 and 5000 series
TOF-TOFs?
Or are they importing native spectra into the database?
Original comment by matt.cha...@gmail.com
on 14 Aug 2009 at 4:03
To clarify, I think in retrospect some of the vendor nativeIDs that have the
same
"scan=xxx" definition should probably have been made as synonyms of "scan number
only" instead of making separate terms. The same would apply to "index=xxx"
terms.
Original comment by matt.cha...@gmail.com
on 14 Aug 2009 at 4:05
If we use the same "multiple peak list nativeID format", then an importer can't
tell
whether the index is for the query number or an index into the original
mgf/pkl/dta file?
The comment says 'i.e.' and not 'e.g.' so strictly speaking it's just
restricted to
those three file types.
The database term isn't for a 4000/5000 series database. It's intended for
ProteinScape or any 'generic' rdb where spectra are stored. The format for
MS:1001480
is pretty specific to the tof-tofs.
Original comment by dcre...@gmail.com
on 14 Aug 2009 at 4:36
Do these non-native databases really not store the necessary information to get
back
to the native spectrum? That would be surprising. If they do, I think the thing
to do
is use the original nativeID. If not, then some other term should be
appropriate. A
generic "database nativeID" term is silly. For one thing, certainly not all
databases
will use xsd:long as a primary key! The nativeID string would make a sensible
nativeID as well (albeit slower, but still reasonable). So for a database
format that
uses a single integer identifier, maybe mzData's term "spectrum identifier" with
"spectrum=xsd:nonNegativeInteger" would be appropriate.
For the query number, what would the file format term be? It wouldn't seem
right for
it to be MGF if you can't map the query number back to the MGF index. Instead,
perhaps DAT should be a valid term for file format and that could easily
indicate
that the id has been rendered useless for getting back to the raw spectrum
because
it's been disconnected from the nativeID?
Original comment by matt.cha...@gmail.com
on 14 Aug 2009 at 4:50
Someone else (Martin) may have comments about the database part. Maybe it
should be
specific to ProteinScape rather than a generic database. However, I think I
agree
with you because the use case here would typically be an export from
ProteinScape to
someone who doesn't have that particular ProteinScape database, but may have
the peak
list and/or raw data file.
For the query number, I partially agree that the file format for the
<SpectraData>
should be .DAT file.
For an mgf file, there is of course a way to get back to the spectrum in the
raw data
file because, in this case, we save the title, scans, retention time etc. as
separate
cv items in the mzIdentML file. However, as you say, the index itself doesn't
help
you get back in one go.
The typical use case will be that the consumer of the mzIdentML file may also
have
the mgf/pkl/dta file and/or the raw data, but not the .dat file.
One problem with using the .dat file as the filename is that you lose the
information
about the filename of the mgf/pkl/dta file. Maybe that filename just has to be
a user
param? Also, we already have the name of the .dat file a few lines above as:
<DataCollection>
<Inputs>
<SourceFile location="file:///../data/F001350.dat" id="SF_1" >
<fileFormat>
<cvParam accession="MS:1001199" name="Mascot DAT file" cvRef="PSI-MS" />
So, I don't have too strong a feeling either way about what the fileFormat of
the
<SpectraData> should be.
However, the description for "multiple peak list nativeID format" would be
rather
misleading:
Index is the spectrum number in the file, starting from 0
So I still think that there should be a new CV term for the query number even
if we
change the 'i.e.' to 'e.g.'
Original comment by dcre...@gmail.com
on 14 Aug 2009 at 5:59
One more thing to agree on...
If the input file is an mzML file, then the fileFormat is:
MS:1000584, name: mzML file
but should the new <spectrumIDFormat> be:
MS:1000767, name: native spectrum identifier format
or the child term specified in the actual mzML file, for example
MS:1000768, name: Thermo nativeID format
Original comment by dcre...@gmail.com
on 17 Aug 2009 at 9:29
I think we should have a term specifically for mzML where the value is exactly
whatever is contained within the <spectrum id="[ ]" > attribute in mzML.
mzIdentML
converters shouldn't need to worry about where the ID format came from further
back
in the pipeline?
Original comment by andrewro...@googlemail.com
on 17 Aug 2009 at 9:41
I agree. So, we have 3 choices:
1. use MS:1000767 (not especially intuitive)
2. make a new term that is a child of MS:1000767
3. make a new term that is a child of ???
Problem with #2 is that it is a bit recursive and could be confusing to the
mzML people?
Original comment by dcre...@gmail.com
on 17 Aug 2009 at 10:53
The nativeID is transitive: transcoding through mzML does not change the
nativeID
format. You should use the same nativeID format that the input document used
because
that's the only reasonable way you're going to preserve the meaning of "sample=1
period=1 cycle=123 experiment=2".
Original comment by matt.cha...@gmail.com
on 17 Aug 2009 at 12:45
Following this argument through would mean that we would then need to specify
the
fileFormat as a .wiff file rather than the mzML file which seems crazy. We
aren't
including much (any) information about how the peak list was produced from the
raw
data, so you _have_ to go back to the mzML file for all that good stuff. We
just want
an index into the mzML file. It could be pigs_might_fly_01, pigs_might_fly_02
etc.
for all we care!
Original comment by dcre...@gmail.com
on 17 Aug 2009 at 1:12
No. File format is not transitive, nativeID is. Use the nativeID, we already
had this
discussion. NativeID works for all input file types; "pigs_might_fly_02" used
to work
for mzML but in 1.1 there is only nativeID.
Original comment by matt.cha...@gmail.com
on 17 Aug 2009 at 2:35
In order to be able to delete an intermediate mzML and go straight from
mzIdentML to
a native file, it would be nice to have sourceFile/dataProcessing info
forwarded from
the mzML to the mzIdentML, but that might be overkill.
Original comment by matt.cha...@gmail.com
on 17 Aug 2009 at 2:37
Let me put it another way: the mzML file will have an id that, for a WIFF file,
MUST
look like "sample=x period=x cycle=x experiment=x". What reason is there to
call that
an "mzML" nativeID instead of a WIFF nativeID?
Original comment by matt.cha...@gmail.com
on 17 Aug 2009 at 2:38
In the MPC use case I cannot use the MGF "ID" because the mgf is "lost", once
the
spectra data is imported into ProteinScape. (Proteinscape sends data to search
engines by creating temporary MGFs not visible to the user).
I agree, that we could use a generic database ID, as the ID itself is not
accessible
to the users. I suggested the "specialised" ID as we have those specialised ID
in the
file case also (Bruker BAF, Bruker YEP).
The main argument now should be, that we need a "release" of the example files.
Original comment by eisena...@googlemail.com
on 17 Aug 2009 at 2:39
I think you misunderstood me. Again taking the WIFF example, if you read a
spectrum
from a WIFF file into ProteinScape, do you keep track of the
sample/period/cycle/experiment, or do you lose that information by putting it
in the
database?
Original comment by matt.cha...@gmail.com
on 17 Aug 2009 at 2:51
In the future there might hopefully be no "original" file, because all
instruments
produce directly mzML (in which case we need a "mzML native ID" term).
Original comment by eisena...@googlemail.com
on 17 Aug 2009 at 2:54
That's a very good point. I don't recall that we've addressed the issue of
synthetic
spectra in the mzML group yet. They may either be synthesized by the acquisition
software or by a spectral decoy/theoretical generation program. If the mzML is
the
original file, then it seems to me that "spectrum=xsd:nonNegativeInteger" would
be
all that was necessary, because concepts like period, cycle, controller,
function,
etc. are all abstracted away.
Original comment by matt.cha...@gmail.com
on 17 Aug 2009 at 3:01
From my experience with ProteinScape tracking that information cannot be
guaranteed
at the moment.
For MGFs for example you sometimes (depending on the tool the MGF was created
with)
have something like "Cmpd 315, +MSn(931.5), 46.2 min", i.e. rather unstructured
information.
BTW: I agree, that the info SHOULD be there (but in fact isn't).
Original comment by eisena...@googlemail.com
on 17 Aug 2009 at 3:02
Comment 19 isn't strictly true is it? If we have merged spectra, then the
format in
the mzML is:
<spectrum id="merged=1">
with no reference to sample, period, cycle and expt? And it could be merged from
multiple files? This is very mzML specific, so I really think we should be
referring
back to the mzML file and calling this an mzML id.
(btw, I couldn't see anything in the mzML doco about merged spectra).
Original comment by dcre...@gmail.com
on 17 Aug 2009 at 3:12
Merged spectra should (must?) have the spectra that went into the merge in the
scanList, and the WIFF nativeIDs would be there instead. If it's not documented,
we'll have to fix that. The merged case is an exception that had to be made in
an
otherwise quite robust system. The fun really begins when you look at including
UV
and MS spectra from Bruker instruments in the same file, where the UV and MS
spectra
have different nativeID formats.
Original comment by matt.cha...@gmail.com
on 17 Aug 2009 at 3:16
Exactly. And we aren't carrying the wiff native IDs from the scan list from the
mzML
through to the mzIdentML file, so the _only_ thing that we can do safely is to
have a
new CV term for spectrumIDFormat which specifies that it is an mzML identifier.
And
we are back to the choices I listed in comment 14. Any preference Matt? (Or do
you
still disagree?)
Original comment by dcre...@gmail.com
on 17 Aug 2009 at 4:19
We seem to have got held up here on an exceptionally minor issue :-)
The value of the spectrumID is the same whichever way they are imported from
mzML,
the only difference is how much work there is for an mzIdentML exporter to work
out
what format the IDs are.
I can see that it's very easy to say, the input to the search is an mzML
format, so
the spectrumID format is mzML nativeID. If it is trivial to examine an mzML
format
and work out the required CV term for <spectrumIDformat> then I don't really
care if
we use the same CV term - although to me this implies a semantics to the
identifier
that the search engine has no interest in.
If it is sometimes not clear or difficult then I'd rather have a new term for
mzML
nativeID.
Either way, a quick decision would be good...
Original comment by andrewro...@googlemail.com
on 17 Aug 2009 at 4:21
"merged=<index>" is a valid id for ANY nativeID format. So it's perfectly safe
to
carry the WIFF nativeID for the unmerged scans and carry the merged id for the
merged
scans.
Original comment by matt.cha...@gmail.com
on 17 Aug 2009 at 4:22
As interesting as this is, I've no time to carry on with the discussion.
If you really, really don't want us to add a new cv term, please just say what
we
should use for this case and we'll go along with your recommendation:
<sourceFileList count="2">
<sourceFile id="SF1" name="*" location="file:... ">
. . .
<cvParam cvRef="MS" accession="MS:1000769" name="Waters nativeID format" />
</sourceFile>
<sourceFile id="SF2" name="xxx.wiff" location="file:...">
. . .
<cvParam cvRef="MS" accession="MS:1000770" name="WIFF nativeID format" />
</sourceFile>
</sourceFileList>
<spectrum id="merged=1">
Thanks.
Original comment by dcre...@gmail.com
on 17 Aug 2009 at 4:55
I do not see a benefit to adding an mzML nativeID term, and losing an easy way
to
identify (programmatically) the id format's defined syntax is a detriment to
me, even
at the search engine and protein assembly steps.
As for the case where two source files are contributing to the mzIdentML:
spectrum
must have a sourceFileRef to make the id unique no matter what option we
choose! I
could just as easily ask you:
<sourceFileList count="2">
<sourceFile id="SF1" name="foo.RAW" location="file:... ">
. . .
<cvParam cvRef="MS" accession="MS:1000769" name="Thermo nativeID format" />
</sourceFile>
<sourceFile id="SF2" name="bar.RAW" location="file:...">
. . .
<cvParam cvRef="MS" accession="MS:1000770" name="Thermo nativeID format" />
</sourceFile>
</sourceFileList>
<spectrum id="controllerType=0 controllerNumber=1 scan=123"> <-- both files
have this
id -->
Original comment by matt.cha...@gmail.com
on 17 Aug 2009 at 5:05
Sorry David, I misunderstood you at first; do I understand correctly that you
mean
that sourceFileList is what's in the mzML? Having different nativeIDs (like the
Bruker BAF/U2 case) does make this nastier, but having different sourceFiles
means it
was nasty to begin with - as my last post should illustrate. You can't simply
point
to the mzML with an id because that id is only unique in the mzML when combined
with
the sourceFileRef. Which leads me to notice that in the mzML schema we don't
constrain the ids with the sourceFileRef (which can easily be done).
Original comment by matt.cha...@gmail.com
on 17 Aug 2009 at 5:15
Oh dear, that's more of a problem... we were relying on this statement in the
mzML
documentation being correct:
"External documents may use this identifier together with the mzML filename or
accession to reference a particular spectrum."
I propose that we release mzIdentML 1.0 without support for mzML input files
and try
and resolve this in the next version.
Original comment by dcre...@gmail.com
on 17 Aug 2009 at 5:16
Currently there is no vendor scenario where there will multiple source files
with
duplicate ids. Waters files have different files for each function, but the
function
number is included in the nativeID. The BAF/U2 case is irrelevant because AFAIK
U2 is
just for LC spectra, which mzIdentML doesn't seem intended for. It is not valid
for a
user to combine multiple runs together into a single mzML file, and that's the
only
case I can think of that would result in duplicate ids. I don't think you need
to say
mzIdentML doesn't support mzML. The only ones you might not be able to support
in 1.0
are future sourceFile combinations, not current vendor conversions.
Original comment by matt.cha...@gmail.com
on 17 Aug 2009 at 5:28
Message from Matt on psidev-ms-dev for further clarification:
Right now the schema does guarantee that spectrum::id is unique. I believe this
won't
be a problem in the future because we will always add a unique axis to the
nativeID
(like Waters RAW has function=x instead of "process=x scan=x" and pointing to
the
sourceFile). And since we currently don't have any cases where there are
different
nativeID formats for MS spectra (BAF/U2 is the only case and that's MS/UV), I
think
mzIdentML is fine to say it supports mzML.
Original comment by dcre...@gmail.com
on 19 Aug 2009 at 12:52
Final decisions on this, which have been included in version 1.0:
- Added required <spectrumIDFormat> element. Exactly one CV term required.
- Added new CV term MS:1001528, name: Mascot query number
(for versions of Mascot prior to version 2.3)
- Added new CV term MS:1001530, name: mzML unique identifier,
- Added new CV term MS:1001526, name: spectrum from database nativeID format
Whilst it's clear from the above discussion that there wasn't 100% agreement
with
these decisions, this was the majority view.
Original comment by dcre...@gmail.com
on 19 Aug 2009 at 1:10
Original comment by dcre...@gmail.com
on 20 Aug 2009 at 8:45
Should retention time be obsoleted and replaced by scan start time? I find it
odd that mzML and mzIdentML use different terms to mean the same thing. For
mzML we agreed that retention/elution time is a peptide/compound chromatography
property, not a spectrum property.
Original comment by matt.cha...@gmail.com
on 15 Apr 2011 at 10:59
Original issue reported on code.google.com by
dcre...@gmail.com
on 13 Aug 2009 at 12:07