mwalzer / psi-pi

Automatically exported from code.google.com/p/psi-pi
0 stars 0 forks source link

Representing TPP probabilities #78

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Eric posted this message to the list. You need to read the attached pdf 
containing images for it to make sense

***************************

2) Our next challenge was how to encode the TPP probabilities. By way of 
background:
PeptideProphet derives a “PSM probability” for each PSM.
iProphet derives a “peptide probability” for each distinct peptide sequence 
(and also refines the individual PSM probabilities)
ProteinProphet derives a “protein probability” for each protein (and also 
reports peptide ion probabilities)
Further, it appears that the <Peptide> element in mzIdentML is really a concept 
like a ModifiedPeptide (charge is coalesced at this level, but two different 
modification states appears to be different <Peptide>s, right?

So, I believe there are potentially 6 different concepts that one could report:

a) “PSM probability” (probability that an individual peptide-spectrum match 
is correct)
b) “peptide ion probability” (probability that a peptide ion [distinguished 
by different charge numbers and mass mods] is correctly identified as present)
c) “modified peptide probability” (probability that a modified peptide 
[coalescing possibly multiple charge numbers] is correctly identified as 
present)
d) “distinct peptide probability” (probability that a distinct peptide 
sequence [coalesced over charge and mods] is correctly identified as present)
e) “protein probability” (probability that a specific protein sequence 
[potentially multiple identifiers] is correctly identified as present)
f) “protein group probability” (probability that at least one member of an 
AmbiguityGroup is correctly identified as present)

(and potentially a few more if one is splitting hairs more than this is 
already).

According to OLS, this is what the CV has for *probabil*:

Essentially, none of the concepts I have advanced.

BUT, there is the concept of FDR. Several in fact. OLS says about *fdr*:

Some might say that a “local FDR” is really a 1 - probability. Can any 
licensed statisticians out there comfortably declare that a “local FDR == 1 - 
probability” or is that false??

I suspect that experts will say that they are somewhat different concepts, that 
a probability applies to a single entry, while a local FDR applies to a 
location in a list and can depend in some ill-defined way on neighbors in the 
list. So I would propose that we should introduce the concepts of probability. 
And I propose different probabilities for each of the concepts listed above 
a-f. Few if any individual software packages will use all concepts, but surely 
the ensemble of all packages will use all of these concepts.

Thoughts? Do we add 6 new generic terms? What do we do about the relationship 
between local FDR and probability?

Confounding this decision, I now notice that the situation is a little bit more 
complicated. From OLS:

It appears that we have several different terms of confidence like e-value, 
local FDR, p-value. And they live under either “peptide identification 
confidence metric” or “protein identification confidence metric”.

What is meant by peptide in this case? Do we mean “distinct peptide” or do 
we mean “PSM”? The definition does not make it clear.

Maybe “local FDR” should have 6 parents, one for each of the above concepts 
a-f? But then we have the problem that the context probably cannot distinguish 
which is meant. mzIdentML doesn’t have all these concepts, so we would 
potentially have to put several probabilities under the same mzIdentML element.

So…. We’re left with the state that in mzIdentML the TPP/PeptideShaker 
probabilities are either being written out as userParams or incorrect/ambiguous 
cvParams. What should we do? Should we try to fix this, or just pick what seems 
to be the least wrong choice? Or maybe someone can show that we have correct 
terms that I failed to find?

Original issue reported on code.google.com by andrewro...@googlemail.com on 22 Apr 2013 at 10:04

Attachments:

GoogleCodeExporter commented 9 years ago
Agree that this is a bit of a mess - terms have been added on an ad hoc basis 
as we have needed them. We really need someone with a reasonable statistical 
grounding to go through a write sensible terms and definitions.

For anyone want some of the background, some of the concepts are described and 
compared in here:
Lukas Käll, John D. Storey, Michael J. MacCoss and William Stafford Noble
Posterior error probabilities and false discovery rates: two sides of the same 
coin
Journal of Proteome Research, 7(1):40-44, January 2008

Original comment by andrewro...@googlemail.com on 22 Apr 2013 at 10:09

GoogleCodeExporter commented 9 years ago
Further to this - if we can't get anyone to do the full job for tidying up CV 
terms for statistical concepts, minimally adding terms as suggested by Eric 
seems a good idea in the short term.

There may be some other concepts beyond A-F, we could think up e.g. 
Modification site probability, false localization rate etc.

Original comment by andrewro...@googlemail.com on 22 Apr 2013 at 10:18