mwalzer / psi-pi

Automatically exported from code.google.com/p/psi-pi
0 stars 0 forks source link

Instance document examples: Mascot #13

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
The following instance document examples are required for Mascot:

MS, protein database
MS-MS, protein database
MS-MS, EST / nucleic acid database
Mixed MS-MS/PMF
Sequence tag search (May not be supported)
Decoy search
Error tolerant search 

Original issue reported on code.google.com by dcre...@gmail.com on 28 Apr 2008 at 1:54

GoogleCodeExporter commented 9 years ago
Need to fill in parameters for the protein determination:
    <ProteinDeterminationProtocol identifier="PDP1" 
AnalysisSoftware_ref="mascot_parser">
      <!-- please fill this in --> 
    </ProteinDeterminationProtocol>

Original comment by delag...@gmail.com on 19 Jun 2008 at 3:41

GoogleCodeExporter commented 9 years ago
ProteinDetectionHypothesis instances need the Sequence_ref attribute.

In fact, is there a reason why the Sequence_ref is not mandatory?

Original comment by andrewro...@googlemail.com on 20 Jun 2008 at 9:30

GoogleCodeExporter commented 9 years ago
Fixed comment 1 and comment 2 in latest example (working20June/F001350.xml)
Yes, Sequence_ref should be mandatory

Original comment by dcre...@gmail.com on 24 Jun 2008 at 2:35

GoogleCodeExporter commented 9 years ago
relating to working27June/F001350.xml

I've split this into three sections: TODO's, Questions and Documentation.  It's 
by no
means a comprehensive look through and may demonstrate more my unfamiliarity 
with the
entire schema than anything else, but here goes.

== TODO's (in no particular order) ==
* replace references for <pf:DatabaseReference ...> "TODO: Should be
SearchDatabase_ref" with "SDB_SwissProt" in <SequenceCollection>
* the CVs and ontologies need linking into this document.  Suggestion for this 
where
appropriate: http://www.ebi.ac.uk/ontology-lookup/ - there is a web service for 
it
too.  We'll also need to presumably get the PSI(-PI) CV in some semblance of 
order
with relevant keys and values.
** some examples: line 35: change accession="TODO" to accession="MOD:01211", 
relating
to the modification "SMA (K)"
** change "Oxidation (M)" modifications to use accession="MOD:00412"
* filters - both the includes and excludes are the same, i.e., "All entries".  
This
doesn't make a great deal of sense
* there are proteins references in the <PeptideEvidence> tags, that are not 
present
in the file.  I assume this is a size-of-xml-file issue or time issue, but it 
will
need fixing for a final release if this is to be an example of analysisXML 
people can use
* there are a number of "blank" tags, especially around the file format (line 
402)
* in the <ProteinDetectionList>, there is a cvParam that is just full of TODO's,
which doesn't really seem to say anything (line 919).  Is this a required tag?
* <ProteinDetectionHypothesis ... DBSequence_ref="...": the dbseq ref should be
prefixed by "DBSeq_"? line 921
* in the <AnalysisSoftwareList>, is it intentional that two different versions 
of
Mascot were used to identify and parse?  Is this to demonstrate what is 
possible in
analysisXML (i.e., flexibility)?
* contact details need updating, including software information for vendors; 
this
requires the CV's to be sorted as well

== Questions ==
* why is <ModificaitonParams> not part of <SearchParams>?
* what is the difference between <InputFile> and <SpectraData> in the <Inputs>? 
 This
may need clearly documenting so that the software engineers properly understand.
* <FilterType>: what is this and what does it pertain to?  Isn't it covered by 
the
use of the ontology/CV in the includes and excludes?

== Documentation ==
* we must document what is meant by SpectrumIdentification and 
PeptideIdentification
** there meaning is not straight forward and intuitive (atleast from an MS/MS
perspective)
* the use of 0 and 1, instead of true and false.  I know programmers use both, 
but
there are oddities where 0 is true, and other times when 0 is false.  We need 
to be
clear, if not using the words true and false

Original comment by julian.s...@gmail.com on 3 Jul 2008 at 12:40

GoogleCodeExporter commented 9 years ago
Not specifying average\monoisotopic anywhere.

Original comment by dcre...@gmail.com on 3 Jul 2008 at 2:15

GoogleCodeExporter commented 9 years ago
>* there are proteins references in the <PeptideEvidence> tags, that are not 
present
> in the file.  I assume this is a size-of-xml-file issue or time issue, but it 
will
> need fixing for a final release if this is to be an example of analysisXML 
people
can use

Added FK ref to DBSequence[@identifier] from PeptideEvidence[@DBSequence_Ref] to
solve this issue

Original comment by delag...@gmail.com on 3 Jul 2008 at 5:17

GoogleCodeExporter commented 9 years ago
Notes on comment 4. These have been fixed in working9July/F001350.xml:
With the exception of the following:

== TODO's (in no particular order) ==
* the CVs and ontologies need linking into this document.  Suggestion for this 
where
appropriate: http://www.ebi.ac.uk/ontology-lookup/ - there is a web service for 
it
too.  We'll also need to presumably get the PSI(-PI) CV in some semblance of 
order
with relevant keys and values.
- Still need work on the CV

** some examples: line 35: change accession="TODO" to accession="MOD:01211", 
relating
to the modification "SMA (K)"
** change "Oxidation (M)" modifications to use accession="MOD:00412"
- Needs a fix to the OBO file and also some more code to determine these

* in the <AnalysisSoftwareList>, is it intentional that two different versions 
of
Mascot were used to identify and parse?  Is this to demonstrate what is 
possible in
analysisXML (i.e., flexibility)?
- There are two pieces of software: the search engine and the "protein 
inferincing" 
and these need separate references. 

* contact details need updating, including software information for vendors; 
this
requires the CV's to be sorted as well
- Not sure how to do this

== Questions ==
* what is the difference between <InputFile> and <SpectraData> in the <Inputs>? 
 This
may need clearly documenting so that the software engineers properly understand.
- InputFile is, for example a Sequest .out file or a Mascot .dat file

* <FilterType>: what is this and what does it pertain to?  Isn't it covered by 
the
use of the ontology/CV in the includes and excludes?
- See the wiki

Original comment by dcre...@gmail.com on 10 Jul 2008 at 2:09

GoogleCodeExporter commented 9 years ago
In the meeting on the 25/Sept/2008, I was asked to look at the example docs, so 
this
is where I started.  In specific I was to pay attention to CV related issues.

The following comments relate to the file examples/Mascot_MSMS_example.axml 
(revision
170):

L14:  not sure how the contact role is going to work with cvParam's
      - no referencce in OBO to software vendor: this presumably
        needs adding in
      - should there be references to the "software details" part of
        the OBO
L28:  is this a not to self, or something genuine?
L35:  what is the CV param for in reference to the
      Provider/ContractRole
L40:  two people in the AuditCollection - the first is blank
      (no id)
L47:  ORG_DOC_OWNER: no details recorded
L61-L196: names on DBSequence
L204: accessions for Modifications (req. extra code)
L470: change line to:
      <pf:cvParam accession="PI:00083" name="ms-ms search" cvRef="PSI-PI" value=""/>
L476: accession = PI:00064 ???
L489,494,499: accession needs filling in
L490,495,500546,547,550,551: unitAccession: no entry in the CV (as far
                             as I can see)
L508: enzymes need adding to the CV
L559-L560: a TODO to be done - the taxonomic filter
L567-L576: Mascot result hits ("search engine specific score" need
           adding to the CV
L587: SearchDatabase - release date and version need altering to
      correct values
L589: Database file format needs filling in
L592: Database name needs filling in
L597: location for SpectraData needs to be supplied
L599: SpectraData file format needs filling in
L614,...,L977: Mascot rank needs adding to the CV and a valid
               accession placing in here
L988,...,L1189: Mascot score needs proper accession value here (PI:00171),
               and the name changing to "mascot:score"

Original comment by julian.s...@gmail.com on 2 Oct 2008 at 1:46

GoogleCodeExporter commented 9 years ago
Some of the issues in comment #8 have been fixed. The following remain:
L14:  not sure how the contact role is going to work with cvParam's
      - no referencce in OBO to software vendor: this presumably
        needs adding in
      - should there be references to the "software details" part of
        the OBO
L35:  what is the CV param for in reference to the
      Provider/ContractRole
L40:  two people in the AuditCollection - the first is blank
      (no id)
L47:  ORG_DOC_OWNER: no details recorded
L204: accessions for Modifications (req. extra code)
L489,494,499: accession needs filling in
L490,495,500546,547,550,551: unitAccession: no entry in the CV (as far
                             as I can see)
L508: enzymes need adding to the CV
L559-L560: a TODO to be done - the taxonomic filter
L567-L576: Mascot result hits ("search engine specific score" need
           adding to the CV
L587: SearchDatabase - release date and version need altering to
      correct values
L589: Database file format needs filling in
L592: Database name needs filling in
L597: location for SpectraData needs to be supplied
L614,...,L977: Mascot rank needs adding to the CV and a valid
               accession placing in here

Original comment by dcre...@gmail.com on 9 Oct 2008 at 2:46

GoogleCodeExporter commented 9 years ago

Original comment by andrewro...@googlemail.com on 16 Oct 2008 at 3:58