mwalzer / psi-pi

Automatically exported from code.google.com/p/psi-pi
0 stars 0 forks source link

Specifying the Search Database(s) Unambiguously - MIRIAM? #31

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
This post concerns the identification of the search database in analysisXML.

Search sequence databases will include:

* complete public databases
* private databases
* custom built sequence databases (which may use custom protein accessions,
or may use accessions from an external source database from which they are
built)
* reverse / shuffled databases e.g. for quality assessment

For any specific database, it is likely that the database may be versioned,
or may have (for example) species specific sets.  A good example of this is
the International Protein Index (IPI) which has both species specific
versions and version numbers.

As a consequence, the annotation of this can become very confused.  For
example, the PRIDE XML format only provides a 'search database name' and a
'search database version' field to identify the search database.  Before
cleanup, IPI for example is described in a multitude of different ways in
PRIDE (See below).

Proposed solution:

1. Use the MIRIAM resource (http://www.biomedcentral.com/1752-0509/1/58/)
as recommended at the PSI Spring meeting, to allow robust identification of
databases.  MIRIAM provides stable accessions for database resources to
unambiguously identify them.  The caveat is that MIRIAM only includes
resources that meet the following criteria:

- be entirely open
- have structured and perennial identifier
- be programmaticaly accessible
- be reasonably maintained

I am currently enquiring whether or not MIRIAM describes how to specify
version numbers for versioned databases.

2. Define an XML schema structure that will allow the identification of
zero to many databases, with each database reference including:

(a) the use of CV (i.e. the MIRIAM stable identifier) OR free text for
custom / private databases to identify the database name.
(b) the use of CV to identify species-specificity (using NCBI taxonomy for
example),
(c) an optional version number
(d) an optional URI for the search database
(e) the use of CV to indicate if a database is reversed / shuffled etc.
(f) A unique key / ID for the database entry so that it may be referenced
from elsewhere in the analysisXML file.
(g) an optional release date
(h) optional number of sequences
(i) optional number of sequences searched
(j) file format?  optional or mandatory?

The current structure for defining the search database looks like this
(example from
http://code.google.com/p/psi-pi/source/browse/trunk/examples/schema_usecase_exam
ples/working6June/F001350.xml)
(Note, the SearchDatabase element can already appear zero to many times, so
no change needed there):

<SearchDatabase
location="file:///C:/INETPUB/MASCOT/sequence/SwissProt/current/SwissProt_51.6.fa
sta"
identifier="SwissProt" name="SwissProt" numDatabaseSequences="257964"
numSequencesSearched="257964" releaseDate="SwissProt_51.6.fasta"
version="SwissProt_51.6.fasta">
    <pf:_fileFormat>
        <pf:cvParam accession="" name="" cvRef="" value="" />
    </pf:_fileFormat>
    <DatabaseName>
        <pf:cvParam accession="" name="" cvRef="" value="" />
    </DatabaseName>
    <!-- pf:cvParam accession="PI:00019" name="number of residues"
cvRef="PSI-PI" value="93947433"/-->
    <!-- pf:cvParam accession="PI:00025" name="database type" cvRef="PSI-PI"
value="AA"/-->
</SearchDatabase>

Changed to (using IPI as an example):

<SearchDatabase
location="ftp://ftp.ebi.ac.uk/pub/databases/ipi/blah/ipi-2.31.fasta"
identifier="unique-internal-identifier" numDatabaseSequences="257964"
numSequencesSearched="257964" releaseDate="2008-12-28" version="2.31">
    <DatabaseIdentifier>
        <pf:cvParam accession="MIR:00000041" name="IPI" cvRef="MIRIAM" />
    </DatabaseIdentifier>
    <DatabaseTaxonomy>
        <pf:cvParam accession="9606" name="Homo sapiens (Human)" cvRef="NEWT" />
    </DatabaseTaxonomy>
    <DatabaseType>
        <pf:cvParam accession="PI:012345" name="Forward Database" cvRef="PSI-PI" />
    </DatabaseType>
    <pf:_fileFormat>
        <pf:cvParam accession="SOMETHING:0012345" name="fasta"
cvRef="FILE_FORMAT_ONTOLOGY" />
    </pf:_fileFormat>
</SearchDatabase>

(pf namespace is FuGE Light)

Ambiguous names and versions for IPI in PRIDE (prior to manual cleanup):

"IPI",""
"IPI","2.21"
"IPI","20050404"
"IPI","3.05"
"IPI","3.12"
"ipi","3.18.fasta"
"IPI","ipi.human.v3.20.fasta"
"IPI","MOUSE 1.25"
"IPI","MOUSE 1.26"
"IPI human",""
"IPI human","2.31"
"ipi.HUMAN","3.01"
"IPI_Bovine","ipi.BOVIN.v3.22.fasta"
"IPI_human","ipi.HUMAN.v3.19.fasta"
"IPI_human","IPI_Human_3.28.fasta"
"IPI_mouse","IPI_mouse_20070812.fasta"
"IPI_mouse","IPI_mouse_v3.34.fasta"
"IPI_rat","IPI_Rat_3.36.fasta"
"IPI_zebrafish","IPI_Zebra_3.31.fasta"

Original issue reported on code.google.com by philip.j...@gmail.com on 18 Jun 2008 at 5:31

GoogleCodeExporter commented 9 years ago
It would be useful to have a way of reporting custom databases - even if it is 
just
free text to describe how the database was constructed?

It would also be beneficial to be able to report the number of peptides in the 
search
database (in addition to the number of proteins searched).

Original comment by jensie...@gmail.com on 27 Jun 2008 at 10:53

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
TeleCon, 10th July 2008: Seems like a hopeless expectation that there will be a
universal and unambiguous way of specifying search database.  Where a custom 
database
has been used, then this should be submitted along with the analysisXML file.

Original comment by eisena...@googlemail.com on 15 Jul 2008 at 3:28

GoogleCodeExporter commented 9 years ago
To close this issue (TeleCon 11th September, 2008):

- use CV to specify DB name for most important DBs (version etc. is specified as
attribute)

- use UserParam for database names not in the CV and submit them alongside the 
data
set (or specify their publicly accessible URI in the location attribute)

Original comment by eisena...@googlemail.com on 11 Sep 2008 at 4:13

GoogleCodeExporter commented 9 years ago
At Turku meeting, decided to re-open this issue. Have the option of specifying a
miriam id rather than using CV. However, it would not be enough for mzIdentML 
on it’s
own. For example, IPI is a single MIRIAM id, by the files are available as 
IPI_Human,
IPI_Mouse etc. so it would also need a taxID coupled to it.

Original comment by dcre...@gmail.com on 5 May 2009 at 10:46

GoogleCodeExporter commented 9 years ago
E-Mail Comment from Pierre-Alain: 

Hi:
a comment:
MIRIAM does not have an entry for Swiss-Prot, for TrEMBL, for uniparc,
for uniref etc. Only one for Uniprot (which is by the way not the
official abbreviation). Apparently, according to the information Luisa
had, ther is no intention to have separate entries for these Uniprot
"sub-"databases. It is therefore not usable so far.

Original comment by delag...@gmail.com on 5 May 2009 at 1:33

GoogleCodeExporter commented 9 years ago
E-Mail comment from David Creasy: 

Hi Pierre-Alain,

Yes, it's not so useful at the moment. Even for IPI it isn't so useful
because there is no entry for MSIPI, and I guess it's unlikely that
there will be looking at the current strategy. Oh, and it also looks as
though the information about IPI is very out of date which doesn't
inspire confidence.

Um... also a bit strange that there is absolutely no mention of anything
from the NCBI.

The agreement at the Turku meeting was to make it an option to use
miriam *or* the existing the current method. My feeling is now that we
should defer supporting it until it's more comprehensive and a little
more 'neutral', but we can discuss on the next conference call.

Thanks,

David

Original comment by delag...@gmail.com on 5 May 2009 at 1:34

GoogleCodeExporter commented 9 years ago
I agree with David. Defer this until there is at least versioned database 
support.

Original comment by delag...@gmail.com on 5 May 2009 at 1:38

GoogleCodeExporter commented 9 years ago
Hi,

This is what I found out after a meeting with the MIRIAM people today:

1. The only requirements for a database to be included in MIRIAM are: to be 
freely
available, accessible on-line, to have stable and unique identifiers for the 
entries
and no licence restriction for using it.
2.Apart from that, if any resource is not there yet is because noone has told 
them
yet. They are very happy to add new ones and it should be a very 
straightforward process.
3.I have told them about the fact that they have at present UniProt and 
UniParc, for
instance, but no specific entries for UniProtKB/SwissProt, TrEMBL, 
UniRef,...They are
going to think about this. Again, noone had told them before. 
4.Both the MIRIAM identifier (for instance, for PRIDE MIR:00000065) or the 
MIRIAM URN
(urn:miriam:pride) could be used for referencing the database.
5.For IPI, they do not want to split it by different species, but as pointed out
before, to add a new CV term (taxID) would be possible.

I think it is out of the scope here, but MIRIAM is being used in MIAME 
documents to
assure that the identifiers reported match the corresponding referenced 
database.

Original comment by javizca74@gmail.com on 13 May 2009 at 2:01

GoogleCodeExporter commented 9 years ago
Thanks for finding this out. I guess the most commonly used 'database' in the 
mass
spec world is the NCBI 'nr' database. Be interesting to know if this is not 
included
because of '1' or '2'.

Also, how about MSIPI. http://www.pil.sdu.dk/msipi.htm
Would they include this as a separate database?

Please let us know what they decide about '3'

Thanks,
David

Original comment by dcre...@gmail.com on 13 May 2009 at 3:23

GoogleCodeExporter commented 9 years ago
In TeleCon May 20th agreement, NOT to use MIRIAM, because:

Not suitable at the moment since MIRIAM cannot incorporate NCBI nr database or
differentiate between TrEMBL and SwissProt for instance.

Original comment by eisena...@googlemail.com on 28 May 2009 at 3:25

GoogleCodeExporter commented 9 years ago

Original comment by eisena...@googlemail.com on 28 May 2009 at 3:26