metadata in PRIDE XML and mztab

medbioinf / pia

:books: :microscope: PIA - Protein Inference Algorithms

https://github.com/medbioinf/pia

Other

22 stars 9 forks source link

metadata in PRIDE XML and mztab #5

Closed ypriverol closed 8 years ago

ypriverol commented 8 years ago

@julianu here how we decided to export metadata from mzidentml -> mztab:

https://github.com/PRIDE-Utilities/ms-data-core-api/blob/master/src/main/java/uk/ac/ebi/pride/utilities/data/exporters/MzIdentMLMzTabConverter.java

and also from PRIDE XML to mztab:
- https://github.com/PRIDE-Utilities/ms-data-core-api/blob/master/src/main/java/uk/ac/ebi/pride/utilities/data/exporters/PRIDEMzTabConverter.java

We should check in the current version of the PIA how we converted PRIDE XML and mztab back to mzIdentML, especially the metadata. Some of the information cab be redundant like the softwares, etc.

Can you have a look

julianu commented 8 years ago

@ypriverol I have a doubt about the validity of the mzTab file provided for testing. As far as I understand the PSM_ID column, the only thing that my differ on lines having the same PSM_ID is the accession. But there are several lines, which have the same PSM_ID and different sequences in the current file. Is this valid? I can circumvent problems with this in the importer, but also could give a warning about the validity of the actual mzTab file.

ypriverol commented 8 years ago

@julian in the mztab specification we don't have anything saying that this is unique, actually I attached here a file with the table of the PSM example section.

screen shot 2016-02-22 at 14 13
03

In the PSM section:

A unique identifier for a PSM within the file. If a PSM can be matched to Description: multiple proteins, the same PSM should be represented on multiple rows with different accessions and the same PSM_ID.

julianu commented 8 years ago

Yes, but that does exactly state, what i would assume and is not fact in the test-file: Only the accessions should differ, nothing else.

Look at the lines with PSM_ID 923 for in the testfile: there are 6 lines, with the five sequences ILSILR, ILSLLR, LISLIR, LISLLR, LLSLLR. One sequence is double (with different accessions). I would assume, that there should be 6 lines with 5 different PSM_IDs instead.

ypriverol commented 8 years ago

@julianu I will check the example perhaps is from one of our old exporters. Will check with a new file and I will let you know.

ypriverol commented 8 years ago

@julianu I found the lines, I will check the issue here and the rationality behind this. Can you check the PRIDE XML import and the metadata.

julianu commented 8 years ago

Yes, will also work on the mzTab importer. I just need one additional check, if PSM_IDs are used in the way they are used in the testfile.

ypriverol commented 8 years ago

@julianu When we exported from PRIDE XML to mztab we used the spectrum ID as the id, because as you know PRIDE XML removed all the psms and keep for each spectrum only one peptide sequence, then the peptide is repeated for each protein.

The problem is basically that some that the I/L Peptides will reference the same spectrum. We can try to change our exporter. But the current validator works and that means we are accepting this case. Don't have a guess what is the best way to proceed.

ypriverol commented 8 years ago

@julianu here the current example I would like to merge: ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2011/03/PRD000397

ypriverol commented 8 years ago

@julianu anything of this?

julianu commented 8 years ago

The metadata are nicely parsed and merged now, #17