vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
262 stars 53 forks source link

Quantity.Quality #1102

Open NicoDrou opened 1 month ago

NicoDrou commented 1 month ago

Dear Vadim,

I would have some questions about the Quantity.Quality. In a previous discussion (https://github.com/vdemichev/DiaNN/discussions/764), you explained this parameter can filter the precursors and keep the high-quality ones. You even suggested a threshold of 0.8 as a basic threshold. However, in a new post, you mention it only makes sense when using QuantUMS (https://github.com/vdemichev/DiaNN/issues/1091).

Therefore, could you confirm we can or shouldn't use the Quantity.Quality to filter our precursors before using the maxLFQ algorithm? What about using a RSD filtering based on QC samples? I have the example of peptides with a RSD of 20% across 60 technical replicates spread over 10 batches but with a Quantity.Quality of 0! What would you trust the most to keep the best precursors for quantification? By the way, are you planning to add a QuantUTM function to your DIA-NN R-package? :D )

Also, in the README you give this definition of the Quantity.Quality: Quantity.Quality when using QuantUMS is equal to 1.0 / (1.0 + SD), where SD is the standard deviation of the LC-MS-derived error in relative precursor quantification But, shouldn't the Quantity.Quality of a precursor always be the same through the different analyses? Because in my case, I have close, but different values at each analysis.

Many thanks in advance for your help.

Best regards, Nicolas

vdemichev commented 1 month ago

Hi Nicolas,

Best, Vadim

NicoDrou commented 1 month ago

Dear Vadim,

thank you. so I guess, the best is to combine both. Is 0.8 can still be used as a basic threshold? In my case it removes ~10% of the total precursors.

vdemichev commented 1 month ago

so I guess, the best is to combine both.

I would expect Quantity.Quality with QuantUMS, when averaged across runs, to be a better metric than CV. Run-specific Quantity.Quality is also useful.

Same applies to PG.MaxLFQ.Quality.

NicoDrou commented 1 month ago

alright. So I guess it is better to use a higher threshold for averaged Quantity.Quality than run specific Quantity quality in order to keep the best precursors without introducing too many missing values. Am I right? What thresholds would you then recommend based on the hundred, thousands tests you probably already run :)

regarding, PG.MaxLFQ.Quality, shouldn't we only use it with QuantUMS with no prerequisite filtering on the precursors? In theory, if we already keep only the high quality precursors based on the Quantity.Quality, only high confident PG should remain, right?

vdemichev commented 1 month ago

higher threshold for averaged Quantity.Quality than run specific Quantity quality

Yes, definitely!

What thresholds would you then recommend based on the hundred, thousands tests you probably already run :)

We did not actually, it's difficult to get a good readout in 'real' experiments. I would suggest to (i) use PG.MaxLFQ.Quality and (ii) filter in a way that does not discard too many proteins, (iii) gradually increase the thresholds to see if you actually can increase the numbers of DE proteins.

shouldn't we only use it with QuantUMS with no prerequisite filtering on the precursors?

Yes.

In theory, if we already keep only the high quality precursors based on the Quantity.Quality, only high confident PG should remain, right?

If you filter precursors stringently, protein quantities will typically get worse. Otherwise DIA-NN would just do this automatically.

NicoDrou commented 1 month ago

Dear Vadim, I am now puzzled. In theory, the more stringently is the filtering of the precursors, the better should be the quantification as we discard the precursors bringing noise to the quantification.

Maybe to give you a bit more context, I am working a longitudinal data of healthy donors. So in addition of the basal level of the proteins, we are also interested in the protein correlations at the donor level. And because we don't have the traditional case vs control group, we need to be very careful in filtering the data. And I am afraid keeping such precursors will generate noise in the data and hide protein correlation. For instance, I have a very complex situation with a protein. This protein is detected with 4 precursors 2 precursors have Quantity.Quality > 0.8 and RSD<20%. One has Quantity.Quality of 0 but RSD of 23% and the 4th one has a very good Quantity.Quality score in my data set(>0.93) but in terms of quantification it doesn't perform well with a RSD between 60 and 200% in my different batches (average 150%). When I only keep the top 2 peptides, based on RSD and Q.Q filtering, I can observe very interesting correlations with some proteins. But when I also consider the Q.Q 9.3 precursors (by only filtering for Q.Q), the correlations are gone. And I observe that for many different proteins. So I am struggling to know whether the correlations biologically exist or if I see ghost correlations in my data set.

Do you have any thoughts on how you would approach the problem?

many thanks again for your help :)

vdemichev commented 1 month ago

I am now puzzled. In theory, the more stringently is the filtering of the precursors, the better should be the quantification as we discard the precursors bringing noise to the quantification.

1 good precursor and 4 noisy ones are often still better than just 1 good one. At least this is what we observe with controlled benchmarks.

And I am afraid keeping such precursors will generate noise in the data and hide protein correlation.

All indeed depends on the purpose. Also note that it's a known phenomenon that different precursors of a protein might not correlate with each other also because they primarily originate from different proteoforms. So measuring more is better in this sense, gives a more complete picture. When you look at one or several specific proteins, it's good to have your script perform and visualise an analysis for each of their precursors, so that you can confirm that different peptides of a protein indeed behave in a similar fashion across your conditions/samples.

NicoDrou commented 1 month ago

All indeed depends on the purpose. Also note that it's a known phenomenon that different precursors of a protein might not correlate with each other also because they primarily originate from different proteoforms. So measuring more is better in this sense, gives a more complete picture. When you look at one or several specific proteins, it's good to have your script perform and visualise an analysis for each of their precursors, so that you can confirm that different peptides of a protein indeed behave in a similar fashion across your conditions/samples.

You have here a very good point and will test it !