percolator / percolator

Semi-supervised learning for peptide identification from shotgun proteomics datasets
http://percolator.ms
Other
127 stars 36 forks source link

Sqt2pin: psm grouping #22

Closed mattiat closed 13 years ago

mattiat commented 13 years ago

Current psm grouping (by scan number) seems to cause information from different files to be retained in memory longer than necessary. Consider instead splitting the fragSpectrumScan: scanNumber="[scan #]_[Filename]".

[from Barbara's email, 7th March 2011]

I noticed that the schema groups psms first by scan number. So within the

tag, there are multiple PSMs all for spectra with the same scan number but with different charge states and from different files. For example (I've removed some of the attributes) ``` ... ... ... ... ... ... ``` Seems like this would require that all the files be read in and stored before the pin.xml can be written out. Perhaps percolator requires that the psms be grouped in this way, but if not maybe it's possible to change the schema slightly so that the pin.xml file is written as the .sqt files are read so that there's no need to store much in memory.
mattiat commented 13 years ago

Hopefully connected with https://github.com/percolator/percolator/issues#issue/13

percolator commented 13 years ago

So I would prefer that you keep scanNumber="[scan #]". So, this is just sqt2pin having a problem -- percolator can handle multiple fragSpectrumScan with same scanNumber. Right?

On Mon, Mar 14, 2011 at 4:31 PM, mattiat reply@reply.github.com wrote:

Current psm grouping (by scan number) seems to cause information from different files to be retained in memory longer than necessary. Consider instead splitting the fragSpectrumScan: scanNumber="[scan #]_[Filename]".

 I noticed that the schema groups psms first by scan number.  So within the

tag, there are multiple PSMs all for spectra with the same scan number but with different charge states and from different files.  For example (I've removed some of the attributes)        ...         ...            ...         ...            ...         ...     Seems like this would require that all the files be read in and stored before the pin.xml can be written out.  Perhaps percolator requires that the psms be grouped in this way, but if not maybe it's possible to change the schema slightly so that the pin.xml file is written as the .sqt files are read so that there's no need to store much in memory.

TODO: _ discuss with Lukas

https://github.com/percolator/percolator/issues/22

mattiat commented 13 years ago

Yes, Percolator has no problem handling fragSpectrumScan with same scanNumber. For each psm in a pin file, the fragSpectrumScan's scanNumber is read and stored in a PSMDescription object. It is not used as an id and CAN be duplicated. To make sure this is the case, I compared the outputs percolator on two almost identical pin files, the only difference being that a fragSpectrumScans had been "broken down" (ie its psms had been grouped into multiple identical fragSpectrumScans). The results were identical.

mattiat commented 13 years ago

Current behavior of sqt2pin:

<fragSpectrumScan experimentalMassToCharge="912.6508" scanNumber="35">
  <peptideSpectrumMatch  id="FileName_35_2_1"></...
  <peptideSpectrumMatch  id="FileName_35_4_1"></...
  <peptideSpectrumMatch  id="DifferentFileName_35_2_1"></...
</fragSpectrumScan>

Modify to:

<fragSpectrumScan experimentalMassToCharge="912.6508" scanNumber="35">
  <peptideSpectrumMatch  id="FileName_35_2_1"></...
  <peptideSpectrumMatch  id="FileName_35_4_1"></...
</fragSpectrumScan>

<fragSpectrumScan experimentalMassToCharge="912.6508" scanNumber="35">
  <peptideSpectrumMatch  id="DifferentFileName_35_2_1"></...
</fragSpectrumScan>
percolator commented 13 years ago

So all the three psms have the same mass. That must be a bug...

Cheers -Lukas On Apr 13, 2011 11:36 AM, "mattiat" < reply@reply.github.com> wrote:

Current behavior of sqt2pin:

... ... ... ... ... ...

Modify to:

... ... ... ... ... ...

Reply to this email directly or view it on GitHub: https://github.com/percolator/percolator/issues/22#comment_994483

mattiat commented 13 years ago

SqtReader::translateSqtFileToXML() has as a parameter a FragSpectrumScanDatabase where all psms are stored. FIX: pass a vector vFSS instead; fill vFSS[i] with information coming from line i-th of the metafile. This operation is repeated twice (once for the target metafile and once for the decoy metafile). ASSUMPTION: corresponding target-decoy files are have the same line number in the respective metafile.

mattiat commented 13 years ago

Commit: https://github.com/percolator/percolator/commit/224ef9931655698fe0b88f8ba5f58da6603e1e6e

mattiat commented 13 years ago

New psm grouping does not solve Genn's problem: https://github.com/percolator/percolator/issues#issue/13.

percolator commented 13 years ago

Mattia,

do not build in the path /scratch/ into any of your code. Its a path that is not supported by any of the major platforms. It's local to cbr and maybe pdc. Use /tmp instead.

Cheers

-Lukas

Lukas Käll http://kaell.org Center for Biomembrane Research Dep. of Biochemistry and Biophysics Stockholms Universitet SE-10691 Stockholm, Sweden Tel:      +46 8 162947 Fax:     +46 8 153679

On Mon, May 23, 2011 at 13:44, mattiat reply@reply.github.com wrote:

New psm grouping does not solve Genn's problem: https://github.com/percolator/percolator/issues#issue/13.

Reply to this email directly or view it on GitHub: https://github.com/percolator/percolator/issues/22#comment_1221195