vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
259 stars 53 forks source link

Calculating Q.Value #876

Open markusk9 opened 9 months ago

markusk9 commented 9 months ago

Hi,

How is the Q.Value in the report.tsv calculated? I noticed across multiple experiments that the Q.Values have a strange distribution. For instance, I searched the "RD139_Narrow_UPS1_25fmol_inj1" file from MSV000087597 (Gotti et al) with 1.8.2 beta 27 using a predicted spectral library, and the Q.Value distribution is as follows:

image

Aren't q-values typically calculated by counting # of decoys / # of targets? Shouldn't the distribution appear exponential?

Searching the same file with MSFragger-DIA yields a distribution more in line with what I would expect: image

As a consequence, the distribution of PEP values also seems off (not cumulatively increasing)

markusk9 commented 9 months ago

Does DIA-NN use the standard definition of a q-value? "q-value is defined to be the minimum FDR at which a PSM, peptide, or protein will appear in the filtered output list" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4563723/)

vdemichev commented 9 months ago

Yes, the definition is standard. What DIA-NN does is indeed calculate the #decoys/#targets ratio, and then for some IDs that appear very confident it has special algorithm that allows assigning them lower q-values, below 1 / #targets which would be the minimum otherwise. This is why you see those q-values below 10^-4.

markusk9 commented 8 months ago

Can you explain what the "special algorithm" is, and why it's necessary? Why not just report the "true" q-values calculated (clearly they are being reporting for all precursors above some threshold)

This algorithm appears to break your statement that the definition is standard, since under the standard definition of a q-value, anywhere that q-values change is where a decoy was located. In other words, the number of unique q-values in a result file should approximate (or precisely equal, under the counting definition) the number of decoys identified at a given CScore threshold.

Under the above statement (true for any algorithm that uses the standard q-value definition), DIA-NN's FDR control appears to be broken - I'm not saying that it is, but I think it's important to clarify what exactly DIA-NN is doing to generate q-values, and why.

It's quite apparent that "true" q-value calculations are broken at some point, based on the CScore to q-value mapping being nearly identical across disparate sets of experiments: image

vdemichev commented 8 months ago

It's necessary because for global q-value calculation later on it's important to detect IDs with very high confidence. Also for situations when <100 precursors are identified at 1% FDR. The algorithm works using a formula that extrapolates the q-value based on CScore. It's 'standard' in the sense that if there were infinite number of decoys, then this is likely how q-values would look like. You can look up how mProphet (Spectronaut basis) q-value calculation works, this is even more 'non-standard'.

oliver-bernhardt commented 8 months ago

The FDR algorithm implemented in Spectronaut was developed by John D. Storey from the Department of Biostatistics at the University of Washington, Seattle. It was published in 2003 and has over 10'000 citations. It is a well recognized approach to FDR estimation that was originally introduced in DNA microarray data analysis: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC170937/

We use this Qvalue/FDR definition in Spectronaut because it makes fewer assumption. Especially in regards to the search-space composition (decoy to false target ratios) and is therefore very universal.

However, I also have to side with Vadim here that extrapolating the CScore is fairly normal as nobody likes to have a pure 0 for a qValue (people tend to think that a 0 qvalue means something is wrong). In contrary, all qvalues before the first decoy observation would necessarily be 0 so there would be no further qualitative distinction for a precursors that scored higher.

So to answer your question (and Vadim correct me if I am wrong in the context of DIA-NN): > Why not just report the "true" q-values calculated (clearly they are being reporting for all precursors above some threshold) It's because they would all be 0 values and therefore not very helpful.

oliver-bernhardt commented 8 months ago

Furthermore, the Storey FDR strategy is also used by Skyline and OpenSWATH. All of the main peptide-centric search tools (except DIA-NN) I know of use the Storey method. So calling it non-standard is doing it a disservice.

fcyu commented 8 months ago

Hi @oliver-bernhardt

The FDR algorithm implemented in Spectronaut was developed by John D. Storey from the Department of Biostatistics at the University of Washington, Seattle. It was published in 2003 and has over 10'000 citations. It is a well recognized approach to FDR estimation that was originally introduced in DNA microarray data analysis: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC170937/

If I understand it correctly, the PNAS paper you posted is about "converting" BH FDR to q-value. This method has been used not only in DIA, but also in many DDA tools. BUT. it is not about how to calculate p-value, which is the input of the BH procedure, from other test statistics or scores in proteomics community. That paper assumes that p-value is already available.

I think what this thread about is how to calculate the FDR from the CScore or whatever score which is not p-value. As you also know, Proteomics tools normally does not calculate any p-value, and the FDR it not from the BH procedure.

ps. I think proteomics people normally does not distinguish FDR and q-value in some literatures. Most people assume the FDR has been already “monotoned" which is actually q-value.

Best,

Fengchao

vdemichev commented 8 months ago

Fully agree Oliver, it's standard in the field, what I meant is not pure decoys/targets ratio.

oliver-bernhardt commented 8 months ago

@fcyu I have to disagree with you. I know for a fact that several different peptide centric DIA analysis tools use the CScore -> pValue -> qValue approach by Storey and as described in the mProphet paper (including Spectronaut, OpenSWATH, Skyline and I think even the original SWATH plugin for PeakView from SCIEX).

fcyu commented 8 months ago

Hi @oliver-bernhardt ,

Thanks for the correction! I guess I don't know DIA tools as good as you do.

Strictly speaking, it should be CScore -> p-value -> BH FDR -> Storey q-value, right?

Since the BH procedure and the Storey's method have no issue, the real question here is how to calculate p-value from CScore (for the tools you mentioned), or calculate the target-decoy-based FDR from the scores (for the tools don't calculate p-value). Are those DIA tools use the "standard" procedure for this step? What is the "standard", BTW?

Best,

Fengchao

markusk9 commented 7 months ago

So to answer your question (and Vadim correct me if I am wrong in the context of DIA-NN): > Why not just report the "true" q-values calculated (clearly they are being reporting for all precursors above some threshold) It's because they would all be 0 values and therefore not very helpful.

Hi Oliver, thanks for chiming in. I am very well accustomed with BH and how FDR is performed for DDA analysis. I believe the recommendation for many DDA tools is to do (1 + # decoys) / (# targets). This eliminates any (perceived) issue of having 0 q-values for the highest scoring targets, and is slightly more conservative.

I would like to also second Fengchao's question - it would be great to know how to calculate p-values from CScore. The root of my question is to gain more insight into how target-decoy FDR is working in DIA-NN (and other peptide-centric tools).