percolator / percolator

Semi-supervised learning for peptide identification from shotgun proteomics datasets
http://percolator.ms
Other
127 stars 36 forks source link

The number of PSMs from Percolator 3.0 #152

Closed wenbostar closed 8 years ago

wenbostar commented 8 years ago

Hi, When I used percolator 3.0 like below, I got a problem about the number of PSMs with qvalue<=0.01.

Percolator version 3.00, Build Date Jun 21 2016 16:19:29 Copyright (c) 2006-9 University of Washington. All rights reserved. Written by Lukas Käll (lukall@u.washington.edu) in the Department of Genome Sciences at the University of Washington. Issued command: ./percolator_linux.exe -U -m target.txt -M decoy.txt Feature.txt Started Wed Jul 6 11:34:38 2016 on compute-53-10.local Hyperparameters selectionFdr=0.01, Cpos=0, Cneg=0, maxNiter=10 Reading input from datafile Feature.txt Features: ...... Train/test set contains 565466 positives and 363759 negatives, size ratio=1.55451 and pi0=1 selecting cpos by cross validation selecting cneg by cross validation Selected feature number 4 as initial search direction, could separate 101751 positives in that direction Selected feature number 4 as initial search direction, could separate 102439 positives in that direction Selected feature number 4 as initial search direction, could separate 101556 positives in that direction Estimating 152767 over q=0.01 in initial direction Reading in data and feature calculation took 58.3 cpu seconds or 59 seconds wall time ---Training with Cpos selected by cross validation, Cneg selected by cross validation, fdr=0.01 Iteration 1 : After the iteration step, 167957 target PSMs with q<0.01 were estimated by cross validation Iteration 2 : After the iteration step, 171439 target PSMs with q<0.01 were estimated by cross validation Iteration 3 : After the iteration step, 172283 target PSMs with q<0.01 were estimated by cross validation Iteration 4 : After the iteration step, 172513 target PSMs with q<0.01 were estimated by cross validation Iteration 5 : After the iteration step, 172614 target PSMs with q<0.01 were estimated by cross validation Iteration 6 : After the iteration step, 172620 target PSMs with q<0.01 were estimated by cross validation Iteration 7 : After the iteration step, 172636 target PSMs with q<0.01 were estimated by cross validation Iteration 8 : After the iteration step, 172635 target PSMs with q<0.01 were estimated by cross validation Iteration 9 : After the iteration step, 172619 target PSMs with q<0.01 were estimated by cross validation Iteration 10 : After the iteration step, 172624 target PSMs with q<0.01 were estimated by cross validation ........ After all training done, 172538 target PSMs with q<0.0100 were found when measuring on the test set Found 172538 target PSMs scoring over 1.0000% FDR level on testset Merging results from 3 datasets Target Decoy Competition yielded 52608 target PSMs and 3227 decoy PSMs Calibrating statistics - calculating q values Merged list gives 49902 PSMs over q=0.0100 Calibrating statistics - calculating Posterior error probabilities (PEPs) Processing took 2526 cpu seconds or 1151 seconds wall time

There are about 170000 target PSMs with q<0.01 in the stage of cross validation, however there are only 49902 PSMs with q<=0.01 in the end of the processing.

Do you konw what the reason is? Best regards! Bo Wen

MatthewThe commented 8 years ago

I can't be sure from just the logs, but do you happen to have multiple PSMs with the same ScanNr, or with the same (ScanNr, expMass) combination if you specify the experimental mass as an extra column as well.

Target-decoy competition reduces all hits from the same (ScanNr, expMass) combination to a single PSM. Typically this would choose between 1 target and 1 decoy PSM, but in your case you might have more than 2 PSMs per (ScanNr, expMass) combination.

wenbostar commented 8 years ago

Hi Matthew The, Thanks for your help. The file header of my feature file is like "index label ScanNr deltaMZ ...". It works well with Percolator 2.09. I wonder whether the header of feature file for Percolator 3.0 is changed and whether the value of ScanNr can be string. Best regards! Bo

MatthewThe commented 8 years ago

In Percolator 2.09 target-decoy competition was turned off by default, which we flipped in Percolator 2.10 (and Percolator 3.0), as it was brought to our attention that not doing target-decoy competition could result in inaccurate q-values. You can turn off target-decoy competition - at your own risk - by specifying the -y flag. We will actually make this the default behavior for PSM-only runs in the next release, as others have been confused by this as well.

ScanNr can unfortunately not be a string. You will either have to encode some extra information in the scannr, e.g. by doing some modular arithmetic, or add an extra column called expMass, which you can then give slightly different values for each target-decoy pair.

wenbostar commented 8 years ago

Got it. Thanks a lot. Best regards! Bo