mwalzer / psi-pi

Automatically exported from code.google.com/p/psi-pi
0 stars 0 forks source link

Issues with the CV #42

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
(Collect change requests for the CV here.)

[Term]
id: PI:00216
name: sequest:PeptideRank
def: "The SEQUEST result 'Rank' in out file (peptide)." [ref:ref]
is_a: PI:00153 ! search engine specific score

I think every search engine has the concept of rank: Make this generic?

Original issue reported on code.google.com by dcre...@gmail.com on 8 Aug 2008 at 9:10

GoogleCodeExporter commented 9 years ago
We didn't have any XSD gurus in the mzML group or they didn't chime in so the
uniqueness of nativeID is not XSD-derived, it's in the specification docs and 
the
semantic validators enforce it (actually they are much more than just unique, 
their
format is strictly defined depending on the source file). I presume this is the
related XSD uniqueness code, what does it mean in plain english? :)
  <xsd:element name="SpectrumIdentificationList"
type="psi-pi:PSI-PI.analysis.search.SpectrumIdentificationListType" 
abstract="false"
substitutionGroup="psi-pi:AnalysisResultList">
    <xsd:unique name="PK_COMPOSITE_SpecRef">
       <xsd:selector xpath="./*"/>
        <xsd:field xpath="@spectrumID"/>
        <xsd:field xpath="@SpectraData_ref"/>
     </xsd:unique>           
  </xsd:element>

I don't understand the hesitation to use nativeID which already has the "if 
mzML it
means this, if mzData it means this, if mzXML it means this, if MGF it means 
this,
etc." logic defined. That way implementers can use the same nativeID parsing 
code for
both standards.

Original comment by matthew....@vanderbilt.edu on 28 Nov 2008 at 6:09

GoogleCodeExporter commented 9 years ago
I meant nativeID instead of spectrumID to facilitate analysisXML output from 
non-mzML
input. I still agree with scan time (my preference is not duplicating terms 
between
CVs), and mgf title/scans/rawscans.

Original comment by matthew....@vanderbilt.edu on 28 Nov 2008 at 6:12

GoogleCodeExporter commented 9 years ago
Hi Matt,
The nativeID enforcement in mzML sounds good to me. 

I'm very sorry, but I don't actually understand what you are proposing:
a) To change the name of the attribute from spectrumID -> nativeID
b) For mzML, to reference the nativeID rather than the id
c) Add a nativeID attribute as well as a spectrumID
d) ?

Or some combination of the above! Perhaps it's just getting too late for me ;) 
David

Original comment by dcre...@gmail.com on 28 Nov 2008 at 7:15

GoogleCodeExporter commented 9 years ago
Both a & b to emphasize the fact that the nativeID is defined no matter what the
format of the source file is. Also, just like mzML, you would define that 
format at
the top of the file, although it doesn't appear there is an analysisXML 
equivalent to
"fileContent/fileDescription" in mzML. The nativeID formats are defined in the 
mzML
CV and the terms map to that top header to define the nativeID format for every
spectrum in the file: see CV terms starting at MS:1000767
 in
http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/controlledVo
cabulary/psi-ms.obo

Original comment by matthew....@vanderbilt.edu on 28 Nov 2008 at 7:28

GoogleCodeExporter commented 9 years ago
Um... in discussions with members of the mzML group it's always been agreed 
that this
is what we will be using. Last agreed and documented to use the id at a
teleconference (which Eric also attended) on 2nd October:
http://psidev.info/index.php?q=node/374

As you said above: "Really there's no nativeID for a merged spectrum, so 
anything we
come up with is a workaround." Am I missing something - are you you suggesting 
that 
search engines should can not rely on the mzML id value and store this in output
files? The mzML schema documentation says, for the id: 
<xs:documentation>A unique identifier for this spectrum. It should be expected 
that
external files may use this identifier together with the mzML filename or 
accession
to reference a particular spectrum.</xs:documentation>

Do you think that this is incorrect?

Also, to change the term from spectrumID -> nativeID would, I think be 
confusing. The
term makes perfect sense in the context of an mzML document, but for 
analysisXML it
could easily imply something native to the search engine rather than one of its 
input
files?

btw, the file format of all the input files (spectra, fasta, search engine 
outputs)
are all defined in the analysisXML documents (search on <pf:fileFormat>) so I'm 
not
sure what you mean. (Ah... I see that a couple of the examples seem to be 
missing
these - hopefully they will get corrected soon. Thanks for pointing this out.)

David

Original comment by dcre...@gmail.com on 30 Nov 2008 at 7:07

GoogleCodeExporter commented 9 years ago
Yes, I was at that meeting too. :) The one (important, IMO) use case we did not
consider at that time is output of analysisXML without a corresponding mzML 
document.
In such a case, the mzML arbitrary id does not exist, but the nativeID does. 
This
fact convinces me that nativeID is a better reference than the arbitrary id.

The change of attribute name to nativeID is not so critical, but I think the 
risk of
confusing the spectrumID with the id attribute when it actually points to the
nativeID attribute is worse than the risk of confusing the nativeID attribute 
with
some property of the search engine. I think the documentation for the nativeID
attribute can easily make it clear what it's supposed to reference, especially 
since
it's on a spectrum-centric element; you can copy it from the mzML schema 
(although I
think this documentation could be improved upon):
<xs:documentation>The native identifier for the spectrum, used by the 
acquisition
software.</xs:documentation>

It's good to know about the header information. The nativeID (or whatever it's 
called
in analysisXML) format term would go in the spectra input definition as a CV 
Param
required by the mapping file.

Original comment by matthew....@vanderbilt.edu on 30 Nov 2008 at 7:35

GoogleCodeExporter commented 9 years ago
Do we allow internal nodes of the PSI:PI ontology show up in an axml instance?

E.g. at
/AnalysisXML/AnalysisProtocolCollection/ProteinDetectionProtocol/AnalysisParams 
the
mapping file currently specifies that only the child terms of
"PI:00194 - quality estimation with decoy database" are allowed. However, the 
MPC
example uses exactly this term, to say that a quality estimation was done using 
a
decoy database. What I like to do is moving all the children of PI:00194 up into
"search input details -> quality estimation options" and keep the term as an
indicator that quality estimation was done by decoy database.

To generalize this, I would enforce through semantic validation hat only leaves 
of
the ontology can be used. This makes the ontology kind of flat, however, it 
reduces
the complexity while interpreting the used terms and is much clearer I think. 

I haven't tried how many changes are needed, but IMO only smaller changes to 
the CV
would be necessary. Any comments? Maybe we could discuss it during the telcon 
today.

Original comment by a.bertsc...@googlemail.com on 5 Dec 2008 at 9:21

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Terms needed for the SpectraST example

output file type is pepXML (this file type is not specific to SpectraST)

and associated peptide/spectra scores
dot : the dot product of two spectra, measuring spectral similarity
dot_bias : a measure of how much of the dot product is dominated by a few peaks 
discriminant score F : spectraST spectrum score
delta: normalised difference between dot product of top hit and runner-up

Original comment by jensie...@gmail.com on 8 Dec 2008 at 4:48

GoogleCodeExporter commented 9 years ago
Suggestion:
At the moment we have some CV terms below "quality estimation with decoy 
database",
which actually describe the pre-generated search database (like "decoy DB 
accession
regexp" or "decoy DB derived from ...").

We should move them below "search database details" as branch "decoy DB details"
together with "decoy DB generation algorithm", "decoy DB type" (e.g. "reverse",
"shuffle", "random") and "decoy DB composition" (e.g. "real+virtual" [formerly 
known
as "forward+reverse"] and "only virtual").

All this is for decoy searches using a pre-generated database.

Decoy approaches with no explicite decoy database (like Mascot Decoy?) need a 
new
term "quality estimation with implicite decoy" next to "quality estimation with 
decoy
DB" under "search input details".

Original comment by eisena...@googlemail.com on 10 Dec 2008 at 4:11

GoogleCodeExporter commented 9 years ago
Discussed/Solved in TeleCon 11th December:

Comment 38:
nativeID: change back to just using mzML 'id'? Andy to change doc

Comment 40:
- no Newt.obo: ignored in validator
- 2nd parent for Paragon scores: done during DocProc (Sean)

Comment 41:
- OBOEdit cannot handle UO import and relationship rows
=> use OBOEdit 2 beta
=> try relationship rows without colon

Comment 44:
- <SourceFile> and <pf:fileFormat> elements are not renamed
(URIs may point to a database "location")
- Martin to add CV term "data stored in database"

Comment 57:
- "use child terms only" is different to "use leaf terms only"
(because child can be internal, if it has children itself)
=> no mechanism to prevent
=> leave CV structure very simple
=> name internal nodes so that they are not being used "intuitively"

Comment 60 (decoy terms):
- Martin to do it like he suggested (instead of real+virtual: forward+decoy)

Original comment by eisena...@googlemail.com on 11 Dec 2008 at 5:37

GoogleCodeExporter commented 9 years ago
I think there is a minor issue with the mapping and CV for:

        <CvMappingRule id="R26"
cvElementPath="/psi-pi:AnalysisXML/psi-pi:DataCollection/psi-pi:Inputs/psi-pi:So
urceFile/pf:cvParam/@accession"
requirementLevel="MAY"  scopePath="" cvTermsCombinationLogic="OR">
            <CvTerm termAccession="PI:00186" useTermName="false" useTerm="false"
termName="source file details" isRepeatable="true" allowChildren="true"
cvIdentifierRef="PSI-PI" />
        </CvMappingRule> 

PI:00186 has a child node of source file format and source file name but the 
source
file format is given in a separate CVParam and mapping, this mapping would 
allow it
to be given twice. 

[Term]
id: PI:00040
name: source file format
def: "Type of the source file, the AnalysisXML was created from." [PSI:PI]
is_a: PI:00186 ! source file details

I think PI:00040 should just be child of the root node?

Original comment by andrewro...@googlemail.com on 15 Dec 2008 at 4:41

GoogleCodeExporter commented 9 years ago
add database file format "PEFF"

Original comment by eisena...@googlemail.com on 29 Apr 2009 at 7:10

GoogleCodeExporter commented 9 years ago
In the PI CV there are peptide and protein Global FDRs. 

The location for some of the terms in the schema must change.

-SpectrumIdentificationList (peptide global FDR).
-ProteinIdentificationList (protein global FDR).

Solution: Collapse the two terms into one (global FDR). Add this global FDR to 
the
example.

Original comment by dcre...@gmail.com on 5 May 2009 at 11:04

GoogleCodeExporter commented 9 years ago
- added PEFF
- deleted MS:1001214 prot:global FDR  changed pep:global FDR (MS:1001364) to 
simply
global FDR, changed inheritence structure to fit into both locations

Original comment by a.bertsc...@googlemail.com on 6 May 2009 at 3:12

GoogleCodeExporter commented 9 years ago
TODOs
- Compile a Software List  (Mascot and Co., only included so far, greylag added.
- Mapping for checksums for databases (terms contained in PSI-MS)
- Database terms? Taxonomy?
Mapping rules missing:
/mzIdentML/DataCollection/Inputs/SourceFile/
/mzIdentML/DataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdenti
ficationResult/SpectrumIdentificationItem/PeptideEvidence/
/mzIdentML/DataCollection/AnalysisData/ProteinDetectionList/ProteinAmbiguityGrou
p/

Original comment by a.bertsc...@googlemail.com on 6 May 2009 at 3:14

GoogleCodeExporter commented 9 years ago
"Regular expression" appears under search input details. These should perhaps 
come
under cleavage agent details and be re-named to "Cleavage agent regular 
expression"

"spectrum identification result details" looks fairly similar to "search result
details". It also has some child terms that look a bit strange: "MGF Raw Scans"

"Unknown modification" appears directly beneath "spectrum interpretation" is 
this
correct?

"role type" (contact roles) appears under "spectrum interpretation" - these 
should
probably go under contacts at the top of the CV

"taxonomy nomenclature" - I don't understand the usage of this term

Original comment by andrewro...@googlemail.com on 7 May 2009 at 2:36

GoogleCodeExporter commented 9 years ago
TODOs
- cleanup software structure; which software terms needs to be added? (Mail 
Matt,
e.g. Waters Software)
- database and Taxonomy (see issue 50); MIRIAM and Co., Mapping and CV
- where to put the term "unknown modification"
- where to put role type
- other cleanups?

Mapping rules missing (no terms allowed at the moment):
/mzIdentML/DataCollection/AnalysisData/SpectrumIdentificationList/SpectrumIdenti
ficationResult/SpectrumIdentificationItem/PeptideEvidence/
/mzIdentML/DataCollection/AnalysisData/ProteinDetectionList/ProteinAmbiguityGrou
p/

Things already done:
- Mapping for checksums for databases and input (source) files
- Obsoleted some term which were duplicates of MS terms
- Regex of cleavage agents are now in the cleavage agent subtree
- ...

Original comment by a.bertsc...@googlemail.com on 20 May 2009 at 12:42

GoogleCodeExporter commented 9 years ago
need CV term under "Analysissoftware" and "Bruker Software": "ProteinExtractor" 
(an
algorithm for protein determination/assembly integrated into ProteinScape

Original comment by eisena...@googlemail.com on 27 May 2009 at 7:25

GoogleCodeExporter commented 9 years ago
Terms needed for X!Tandem, Omssa and SpectraST software.

Original comment by jensie...@gmail.com on 27 May 2009 at 8:10

GoogleCodeExporter commented 9 years ago
Terms required for:
  Mascot Parser, Mascot Distiller and Mascot Integra (Analysis Software)
  Percolator (Analysis software)

 and the following three scores:
    percolator:Q value
    percolator:score
    percolaror:PEP  (Posterior error probability)
is_a: MS:1001143 ! search engine specific score for peptides
is_a: MS:1001153 ! search engine specific score

Original comment by margaret...@googlemail.com on 28 May 2009 at 11:28

GoogleCodeExporter commented 9 years ago
need new CV terms for <Threshold>, at least
"no threshold"

Original comment by eisena...@googlemail.com on 28 May 2009 at 3:12

GoogleCodeExporter commented 9 years ago
I remember the group having agreed to replace "DB composition forward+decoy" 
with "DB
composition target+decoy" in a TeleCon or in Turku, but I cannot find it in 
minutes.

The argument was, that we not only have decoy type "reverse", but others...

Original comment by eisena...@googlemail.com on 28 May 2009 at 3:23

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
the term "MS:1001447 - prot:FDR threshold" as CVParam within <Threshold> should 
allow
a value, because I want to specify 5 Percent.

Or do we want to give here only the TYPE of threshold and the actual value 
within the
ANALYSISPARAMS section?

Original comment by eisena...@googlemail.com on 28 May 2009 at 3:45

GoogleCodeExporter commented 9 years ago
The MPC example does not validate, because it uses "prot:FDR threshold",
but "prot:global FDR" is allowed in the mapping file.

I remember, that "prot:global FDR" was created as a result
(being a child of "spectrum identification result details")
and "prot:FDR threshold" as search input, because it is used
as input in the Threshold element (being part of "Protocol").

Same for "pep:FDR threshold" and "pep:global FDR".

Original comment by eisena...@googlemail.com on 17 Jun 2009 at 12:46

GoogleCodeExporter commented 9 years ago
We need to be able to record that users have manually accepted or rejected 
proteins
and peptides.  Could I use the CVParam 'Quality estimation by manual 
estimation' to
do this (at both the protein and peptide level)?  Is there a CVTerm to record 
user
comments about why a protein or peptide was accepted or rejected?

Original comment by patri...@matrixscience.com on 17 Jun 2009 at 4:19

GoogleCodeExporter commented 9 years ago
In response to comment 77, I think you should use:

id: MS:1001125
name: manual validation

I'm not sure about user comments. Currently, values are not allowed for this CV 
term
but we could allow a string value to be populated with comments. Any opinions?

Original comment by andrewro...@googlemail.com on 17 Jun 2009 at 4:34

GoogleCodeExporter commented 9 years ago
A string would be minimally ok, but I would suggest a kind of binary value: true
/false or accepted/rejected

Original comment by pierreal...@gmail.com on 17 Jun 2009 at 7:22

GoogleCodeExporter commented 9 years ago
I think that the passThreshold="true" or passThreshold="false" attributes should
still apply? It's just that it is the user's own threshold in this case. In 
which
case, adding a string value to the comments would be fine. Um... looks like it 
can
take a string already:

[Term]
id: MS:1001125
name: manual validation
def: "Result of quality estimation: decision of a manual validation." [PSI:PI]
xref: value-type:xsd\:string "The allowed value-type for this CV term."

Perhaps we just need to add this to one of the examples?

Original comment by dcre...@gmail.com on 17 Jun 2009 at 7:52

GoogleCodeExporter commented 9 years ago
We seem to have two different ways of specifying taxonomy:
  <SequenceCollection>
    <DBSequence id="DBSeq_HSP7D_MANSE" length="652"
SearchDatabase_ref="SDB_SwissProt" accession="HSP7D_MANSE" >
      <seq>MAKAPAVGIDLGTTYSCVGVFQHGKVEIIANDQGNRTTPSYVAFTDTDRLIGDAAKNQVAMNP...</seq>
      <cvParam accession="MS:1001088" name="protein description" cvRef="PSI-MS"
value="Heat shock 70 kDa protein cognate... - Manduca sexta ..." />
      <cvParam accession="MS:1001469" name="taxonomy: scientific name" cvRef="PSI-MS"
value="Manduca sexta"/>
      <cvParam accession="MS:1001467" name="taxonomy: NCBI TaxID" cvRef="PSI-MS"
value="7130"/>
    </DBSequence>

and
    <cv id="NCBI-TAXONOMY" fullName="NCBI-TAXONOMY"
URI="ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz"></cv>
 . . .
      <DatabaseFilters>
        <Filter>
          <FilterType>
            <cvParam accession="MS:1001020" name="DB filter taxonomy" cvRef="PSI-MS" />
          </FilterType>
          <Include>
            <cvParam accession="NCBI:33208" name="Metazoa" cvRef="NCBI-TAXONOMY" />
          </Include>
        </Filter>
      </DatabaseFilters>

Obviously we should be consistent and get rid of one of the methods.

The other CV that could be used in the first example is:

id: MS:1001470
name: taxonomy: Swiss-Prot ID

id: MS:1001468
name: taxonomy: common name 

I suggest that we ditch the second method and allow:

MS:1001467 - taxonomy: NCBI TaxID      
MS:1001468 - taxonomy: common name
MS:1001469 - taxonomy: scientific name
MS:1001470 - taxonomy: Swiss-Prot ID

to be included in DatabaseFilters/Filter/Include
For MS:1001468, MS:1001469, MS:1001470, I would like to see something like this 
added
to the def:
 Recommend using MS:1001467 where possible

For MS:1001467, the type should be an unsigned 32 bit integer

Original comment by dcre...@gmail.com on 23 Jun 2009 at 10:43

GoogleCodeExporter commented 9 years ago
Should the FDR threshold terms be allowed to have units? 

i.e. "MS:1001447 - prot:FDR threshold" and the respective pep term. 

per cent?
parts per notation?

Original comment by a.bertsc...@googlemail.com on 23 Jun 2009 at 3:48

GoogleCodeExporter commented 9 years ago
We use percent, but both should be allowed

Original comment by eisena...@googlemail.com on 24 Jun 2009 at 8:28

GoogleCodeExporter commented 9 years ago
Decided to allow the following two terms only:

[Term]
id: UO:0000186
name: dimensionless unit
def: "A derived unit which is a standard measure of physical quantity 
consisting of
only a numerical number without any units." [Wikipedia:Wikipedia
"http://www.wikipedia.org/"]
is_a: UO:0000046 ! derived unit

[Term]
id: UO:0000187
name: percent
def: "A dimensionless ratio unit which denotes numbers as fractions of 100."
[Wikipedia:Wikipedia "http://www.wikipedia.org/"]
synonym: "%" EXACT []
is_a: UO:0000190 ! ratio

we need examples of these i.e. "5" "percent"  and "0.05" "dimensionless unit"

Original comment by andrewro...@googlemail.com on 25 Jun 2009 at 3:50

GoogleCodeExporter commented 9 years ago
Fixing the mapping for taxonomy, a similar mapping is required for 3 parts of 
the schema:

- Sample (previously called GenericMaterial)
- DBSequence
- Filter Include/Exclude

Filter currently maps to "database filtering" which has a child term: DB filter
taxonomy. This term should be depracated.

I have added a mapping for all three to the child terms of MS:1001089 "molecule 
taxonomy"

I have also added a mapping from FilterType to the exact term (no child terms)
"molecule taxonomy"

"molecule taxonomy" is currently a child of "database sequence details" so we 
should
move it up the hierarchy so it covers samples as well. 

Andreas, can you give me a mail offline to check through these changes?

Original comment by andrewro...@googlemail.com on 26 Jun 2009 at 10:32

GoogleCodeExporter commented 9 years ago
When correcting the MPC example it became obvious that some CV terms are 
missing:

[Term]
id: MS:1001XXX
name: ProteinScape:SearchResultId
def: "The SearchResultId of this peptide as SearchResult in the ProteinScape 
database."
xref: value-type:xsd\:positiveInteger "The allowed value-type for this CV term."
is_a: MS:1001143 ! search engine specific score for peptides
is_a: MS:1001153 ! search engine specific score

[Term]
id: MS:1001XXX
name: ProteinScape:SearchEventId
def: "The SearchEventId of the SearchEvent in the ProteinScape database."
xref: value-type:xsd\:positiveInteger "The allowed value-type for this CV term."
is_a: MS:1001143 ! search engine specific score for peptides
is_a: MS:1001153 ! search engine specific score

[Term]
id: MS:1001XXX
name: ProteinScape:ProfoundProbability
def: "The Profound probability score stored by ProteinScape."
xref: value-type:xsd\:double "The allowed value-type for this CV term."
is_a: MS:1001143 ! search engine specific score for peptides
is_a: MS:1001153 ! search engine specific score

[Term]
id: MS:1001XXX
name: Profound:z value
def: "The Profound z value."
xref: value-type:xsd\:double "The allowed value-type for this CV term."
is_a: MS:1001143 ! search engine specific score for peptides
is_a: MS:1001153 ! search engine specific score

[Term]
id: MS:1001XXX
name: Profound:Cluster
def: "The Profound cluster score."
xref: value-type:xsd\:double "The allowed value-type for this CV term."
is_a: MS:1001143 ! search engine specific score for peptides
is_a: MS:1001153 ! search engine specific score

[Term]
id: MS:1001XXX
name: Profound:ClusterRank
def: "The Profound cluster rank."
xref: value-type:xsd\:positiveInteger "The allowed value-type for this CV term."
is_a: MS:1001143 ! search engine specific score for peptides
is_a: MS:1001153 ! search engine specific score

[Term]
id: MS:1001XXX
name: MSFit:Mowse score
def: "The MSFit Mowse score."
xref: value-type:xsd\:double "The allowed value-type for this CV term."
is_a: MS:1001143 ! search engine specific score for peptides
is_a: MS:1001153 ! search engine specific score

[Term]
id: MS:1001XXX
name: Sonar:Score
def: "The Sonar score."
xref: value-type:xsd\:double "The allowed value-type for this CV term."
is_a: MS:1001143 ! search engine specific score for peptides
is_a: MS:1001153 ! search engine specific score

[Term]
id: MS:1001XXX
name: ProteinScape:PFFSolverExp
def: "The ProteinSolver exp value stored by ProteinScape."
xref: value-type:xsd\:double "The allowed value-type for this CV term."
is_a: MS:1001143 ! search engine specific score for peptides
is_a: MS:1001153 ! search engine specific score

[Term]
id: MS:1001XXX
name: ProteinScape:PFFSolverScore
def: "The ProteinSolver score stored by ProteinScape."
xref: value-type:xsd\:double "The allowed value-type for this CV term."
is_a: MS:1001143 ! search engine specific score for peptides
is_a: MS:1001153 ! search engine specific score

[Term]
id: MS:1001XXX
name: ProteinScape:IntensityCoverage
def: "The intensity coverage of the identified peaks in the spectrum calculated 
by
ProteinScape."
xref: value-type:xsd\:double "The allowed value-type for this CV term."
is_a: MS:1001143 ! search engine specific score for peptides
is_a: MS:1001153 ! search engine specific score

[Term]
id: MS:1001XXX
name: ProteinScape:SequestMetaScore
def: "The Sequest meta score calculated by ProteinScape from the original 
Sequest
scores."
xref: value-type:xsd\:double "The allowed value-type for this CV term."
is_a: MS:1001143 ! search engine specific score for peptides
is_a: MS:1001153 ! search engine specific score

[Term]
id: MS:1001XXX
name: ProteinExtractor:Score
def: "The score calculated by ProteinExtractor."
xref: value-type:xsd\:double "The allowed value-type for this CV term."
is_a: MS:1001116 ! single protein result details
is_a: MS:1001153 ! search engine specific score

Original comment by eisena...@googlemail.com on 26 Jun 2009 at 1:28

GoogleCodeExporter commented 9 years ago
The validator gives errors for the MPC example, because value types of some CV 
terms
are set to 'boolean', although they should have:

type string: 'MS:1001424 - ProteinExtractor:Methodname'
type positiveInteger: 'MS:1001427 - ProteinExtractor:MaxNumberOfProteins'
type double, unit Dalton: 'MS:1001428 - ProteinExtractor:MaxProteinMass'
type positiveInteger: 'MS:1001429 - ProteinExtractor:MinNumberOfPeptides'

Original comment by eisena...@googlemail.com on 26 Jun 2009 at 1:46

GoogleCodeExporter commented 9 years ago
Implemented changes in CV and mapping file
Addressing comments 81-84,86,87
Semantic Validator is also up-to-date

Original comment by a.bertsc...@googlemail.com on 29 Jun 2009 at 1:54

GoogleCodeExporter commented 9 years ago
Latest Mascot examples give these errors. I think that the mapping needs to be 
changed!

Obsolete CV term: 'MS:1001020 - DB filter taxonomy' at element
'/mzIdentML/AnalysisProtocolCollection/SpectrumIdentificationProtocol/DatabaseFi
lters/Filter/FilterType'
CV term used in invalid element: 'MS:1001316 - mascot:SigThreshold' at element
'/mzIdentML/AnalysisProtocolCollection/ProteinDetectionProtocol/Threshold'
CV term used in invalid element: 'MS:1001316 - mascot:SigThreshold' at element
'/mzIdentML/AnalysisProtocolCollection/SpectrumIdentificationProtocol/Threshold'
Violated mapping rule 'ProteinDetectionProtocolThreshold_rule' at element
'/mzIdentML/AnalysisProtocolCollection/ProteinDetectionProtocol/Threshold' 
exactly
one of the allowed terms must be used!
Violated mapping rule 'SpectrumIdentificationProtocolThreshold_rule' at element
'/mzIdentML/AnalysisProtocolCollection/SpectrumIdentificationProtocol/Threshold'
exactly one of the allowed terms must be used!

Original comment by dcre...@gmail.com on 30 Jun 2009 at 2:44

GoogleCodeExporter commented 9 years ago
This has been added to the CV for prot:FDR threshold

relationship: has_units UO:0000186 ! dimensionless unit
relationship: has_units UO:0000187 ! percent

The same needs to be added for pep:FDR threshold

Original comment by andrewro...@googlemail.com on 30 Jun 2009 at 2:48

GoogleCodeExporter commented 9 years ago
I have updated the mapping for threshold to include:

"search engine specific input parameter" and changed XOR to OR

Original comment by andrewro...@googlemail.com on 30 Jun 2009 at 4:05

GoogleCodeExporter commented 9 years ago
I checked some (from MS:100100X to approx. MS:1001051) CV terms and found some 
errors:

very many CV terms have no "[PSI:PI]" after the definition.
Is it necessary?

MS:1001420, SpectraST:delta:
wrongly spelled: "[:PSI-PI]" instead of "[PSI:PI]"

MS:1001007, sequest:Output lines:
value type missing (xsd:nonNegativeInteger ?)

MS:1001014, database local file path:
value type xsd:string is missing

MS:1001016, database version:
value type xsd:string is missing
There is a "version" attribute in the <Searchdatabase> element.
One of both is not necessary.

MS:1001017, database release date:
value type xsd:string is missing
There is a "releaseDate" attribute in the <Searchdatabase> element.
One of both is not necessary.

MS:1001019, database filtering:
In the definition, what does "public or private" mean?

MS:1001020, DB filter taxonomy:
def should be changed: "A taxonomy filter was applied to the search database." 
[PSI:PI]

MS:1001024, translation frame:
value type is missing
In the Mascot NA example, a <DatabaseTranslation> element is used,
using a "frames" attribute. Is the CV term obsolete?

MS:1001028, sequest:SequenceHeaderFilter:
value type missing (xsd:string ?)

MS:1001032, sequest:SequencePartialFilter:
value type missing (xsd:string ?)

MS:1001035, date / time search performed:
value type missing (xsd:dateTime)
The <SpectrumIdentification> element has an "activityDate" attribute.
One of both is not necessary.

MS:1001037, sequest:ShowFragmentIons:
value type missing (xsd:boolean)

MS:1001038, sequest:Consensus:
value type missing (xsd:positiveInteger)

MS:1001051, multiple enzyme combination rules:
value type missing (xsd:string ?)
The term is not referenced in the mapping file (nor is cleavage agent details, 
its
parent).
Additionally the <Enzyme> element has an "independent" attribute.
Can this CV term be deleted?

Original comment by eisena...@googlemail.com on 8 Jul 2009 at 3:50

GoogleCodeExporter commented 9 years ago
I sent my comments to Andreas. I am not including here minor typo errors. I am 
also
attaching Andreas comments (### lines).

1) These 2 terms are somehow redundant. The def of MS:1001013 should be changed
(remove the examples, I guess):

[Term]
id: MS:1001011
name: search database details
def: "Details about the database searched." [PSI:PI]
is_a: MS:1001249 ! search input details

[Term]
id: MS:1001013
name: database name
def: "The name of the search database (nr, SwissProt or est_human)." [PSI:PI]
is_a: MS:1001011 ! search database details

##### the 1001013 term is used for mapping to
##### 
/mzIdentML/DataCollection/Inputs/SearchDatabase/DatabaseName/cvParam/@accession
##### all the child term of the 1013 are allowed here, but not the term itself
##### the first term is mapped to
/mzIdentML/DataCollection/Inputs/SearchDatabase/cvParam/@accession
##### and contains a large collection of terms
##### however, some of the child terms of 1001011 should not be used.
##### maybe we should restructure the CV, to only allow term that make sense?

2) Perhaps the def should be changed in the definition? (range:-3, +3, not zero,
instead of 1-6). 

[Term]
id: MS:1001024
name: translation frame
def: "The translated open reading frames from a nucleotide database considered 
in the
search (range: 1-6)." [PSI:PI]
is_a: MS:1001011 ! search database details

3) Perhaps change term name DB filter on sequences” to “DB filter on amino 
acid
sequence pattern”?

[Term]
id: MS:1001027
name: DB filter on sequences
def: "Filtering applied specifically by amino acid sequence pattern." [PSI:PI]
is_a: MS:1001019 ! database filtering

##### maybe "DB filter on sequence pattern"? The phrase amino acid would 
restrict it
to protein databases

4)Change term name to quality estimation method details?

[Term]
id: MS:1001060
name: quality estimation details
def: "Method for quality estimation (manually or wih decoy database)." [PSI:PI]
is_a: MS:1001249 ! search input details

#### accepted and done 

5) Delete this one?

[Term]
id: MS:1001060
name: quality estimation details
def: "Method for quality estimation (manually or wih decoy database)." [PSI:PI]
is_a: MS:1001249 ! search input details

#### same as above?

6) Updated def. I would change the name to “database type nucleotide”:

[Term]
id: MS:1001079
name: database type NA
def: "Database contains nucleic acid sequences." [PSI:PI]
is_a: MS:1001018 ! database type

#### accepted and done

7) I would update the term to “sequence coverage”.

[Term]
id: MS:1001093
name: coverage
def: "The percent coverage for the protein based upon the matched peptide 
sequences
(can be calculated)." [PSI:PI]
xref: value-type:xsd\:decimal "The allowed value-type for this CV term."
is_a: MS:1001116 ! single protein result details

#### accepted and done

8)  Update def:

[Term]
id: MS:1001115
name: scan number(s)
def: "Take from mzData. TODO: What does this mean?" [PSI:PI]
is_a: MS:1001105 ! peptide result details

9) Change this term: this is not a name for any database. It could be changed to
“database type EST”. Or if you are referring to the EST database from NCBI, 
it should
be called: dbEST.

[Term]
id: MS:1001178
name: database EST
is_a: MS:1001013 ! database name

10) Add synonyms to all terms containing “product ion” (fragment ion).

[Term]
id: MS:1001225
name: product ion m/z
def: "The m/z of the product ion." [PSI:PI]
is_a: MS:1001221 ! fragmentation information

[Term]
id: MS:1001226
name: product ion intensity
def: "The intensity of the product ion." [PSI:PI]
is_a: MS:1001221 ! fragmentation information

[Term]
id: MS:1001227
name: product ion m/z error
def: "The product ion m/z error (ADD more docu here)." [PSI:PI]
is_a: MS:1001221 ! fragmentation information

#### accepted and done

11) This term name is wrong. There is no database called EST (dbEST is the one 
from
the NCBI, for instance).

[Term]
id: MS:1001295
name: decoy DB from EST
is_a: MS:1001284 ! decoy DB derived from

#### see 9) above

12) Is it pending to add more mascot related terms?

[Term]
id: MS:1001326
name: TODO_add_others
is_a: MS:1001302 ! search engine specific input parameter

13) These terms are redundant:

[Term]
id: MS:1001343
name: NA sequence
def: "The sequence is a nucleic acid sequence." [PSI:PI]
is_a: MS:1001342 ! database sequence details

[Term]
id: MS:1001344
name: AA sequence
def: "The sequence is a amino acid sequence." [PSI:PI]
is_a: MS:1001342 ! database sequence details

There are those ones:

[Term]
id: MS:1001073
name: database type AA
def: "Database contains amino acid sequences." [PSI:PI]
is_a: MS:1001018 ! database type

[Term]
id: MS:1001079
name: database type NA
def: "Database contains nucleic acid sequences." [PSI:PI]
is_a: MS:1001018 ! database type

#### right these are redundant. However, they can be used in different locations
#### I would suggest to keep them

Original comment by javizca74@gmail.com on 9 Jul 2009 at 10:40

GoogleCodeExporter commented 9 years ago
Some ion types are missing in the CV. We have added to the PRIDE CV, in order to
provide fragment ion annotation.

- At the moment there are no -H2O and -NH3 ions for ions c, x and z.
- The same above for precursor ion.
- We have also found that mascot reports the specific type of immonium ion:

immonium A
immonium C
immonium D
immonium E
immonium F
immonium H
immonium I
immonium K
immonium L
immonium M
immonium N
immonium P
immonium Q
immonium R
immonium S
immonium T
immonium V
immonium W
immonium Y

Original comment by javizca74@gmail.com on 9 Jul 2009 at 10:59

GoogleCodeExporter commented 9 years ago
These are all fairly minor so can probably be fixed without discussion

1. I think this term need a unit: e.g. 
relationship: has_units UO:0000221 ! dalton
relationship: has_units UO:0000222 ! kilodalton

[Term]
id: MS:1001361
name: alternate mass
def: "List of masses a non-standard letter code is replaced with." [PSI:PI]
xref: value-type:xsd\:double "The allowed value-type for this CV term."
is_a: MS:1001359 ! ambiguous residues

2. I don't see the point of this term:

[Term]
id: MS:1001057
name: tolerance on types
def: "Tolerance on types." [PSI:PI]
is_a: MS:1001055 ! modification parameters

3. Most of the children of quantification have errors (missing data type/unit) 
but we
can probably leave these for now:

[Term]
id: MS:1001129
name: quantification information
def: "Quantification information." [PSI:PI]
relationship: part_of MS:1001000 ! spectrum interpretation

4. This term needs a datatype (xref: value-type:xsd\:string "The allowed 
value-type
for this CV term.")

[Term]
id: MS:1001051
name: multiple enzyme combination rules
def: "Description of multiple enzyme digestion protocol, if any." [PSI:PI]
is_a: MS:1001044 ! cleavage agent details

5. This term and related terms need units e.g. 
relationship: has_units UO:0000221 ! dalton
relationship: has_units UO:0000222 ! kilodalton

[Term]
id: MS:1001201
name: DB MW filter maximum
xref: value-type:xsd\:double "The allowed value-type for this CV term."
is_a: MS:1001512 ! Sequence database filters

6. This term needs a URI datatype? (xsd:anyURI)

[Term]
id: MS:1001014
name: database local file path
def: "Local file path of the search database from the search engine's point of 
view."
[PSI:PI]
is_a: MS:1001011 ! search database details

7. Add date type to this (xsd:dateTime)

[Term]
id: MS:1001017
name: database release date
def: "Release date of the search database." [PSI:PI]
is_a: MS:1001011 ! search database details

8. Change string datatype to URI?

[Term]
id: MS:1001015
name: database original uri
def: "URI, from where the search database was originally downloaded." [PSI:PI]
xref: value-type:xsd\:string "The allowed value-type for this CV term."
is_a: MS:1001011 ! search database details

9. Needs a string datatype:

[Term]
id: MS:1001016
name: database version
def: "Version of the search database ." [PSI:PI]
is_a: MS:1001011 ! search database details

10. this needs a datatype (not exactly sure if an int or a string?):

[Term]
id: MS:1001024
name: translation frame
def: "The translated open reading frames from a nucleotide database considered 
in the
search (range: 1-6)." [PSI:PI]
is_a: MS:1001011 ! search database details

Original comment by andrewro...@googlemail.com on 9 Jul 2009 at 1:54

GoogleCodeExporter commented 9 years ago
Once implemented, we need CV terms for some conversion tools (for SourceFile 
element)

Original comment by eisena...@googlemail.com on 16 Nov 2009 at 2:38

GoogleCodeExporter commented 9 years ago

Original comment by eisena...@googlemail.com on 10 Jun 2010 at 8:55