vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
259 stars 53 forks source link

sample order affects results despite "unrelated runs" option #914

Closed chscho closed 6 months ago

chscho commented 7 months ago

Hi

We've re-run a search of the same three RAW files in different order (A>B>C vs. C>B>A), to test the influence of the "unrelated run" option. Comparison of these two search results resulted in a reduced scattering compared to the same approach with "unrelated run" option deactivated. Nevertheless, even with activated "unrelated run" option, some proteins still differ quite strongly between the two searches, which is quite puzzling to us. Can you explain, why this is - and how the influence of the sample order can be reduced to increase reproducibility?

Screenshot 2024-01-24 at 15 29 00

Thank you already in advance for your help and explanation.

Best, Christian

report_ABC.log.txt report_CBA.log.txt

vdemichev commented 7 months ago

Hi Christian,

Sample order affects here

For reproducible incremental processing (if this is the goal), please see https://github.com/vdemichev/DiaNN?tab=readme-ov-file#frequently-asked-questions

Best, Vadim

chscho commented 7 months ago

Dear Vadim

Thank you for your swift reply. If this effect is "by design", can you elaborate a bit on how exactly the sample order affects the quantification? We don't think, that incremental processing settings is what we need (we're talking here of experiments with less than 20 samples), but we were just stunned that processing files in the order A>B>C vs. C>B>A can result in a 5 % difference in ID numbers and some very strong deviations in quantification (We actually also see a difference if we compare C>B>A vs. C>A>B, but smaller). Are there any recommendations on how to optimally define the order of the files (other than alphanumeric) - or do you recommend a specific post-search filtering to mitigate the effect of the sample order?

Best, Christian

chenliangyu18 commented 6 months ago

We have faced the same confusion. During large queues, a slight order change will influence the results.

chscho commented 6 months ago

Dear Vadim

It looks like we have to accept a certain change in quantification based on the order of samples.

Nevertheless, we’ve meanwhile also checked the influence of a minor change of the search DB on the protein quantification and hope that you can explain what is happening here:

So, basically we’ve just added a single protein (GFP) to the Swissprot human proteome + E.coli proteome (~25’000 entries) and used the identical RAW files to perform the "directDIA" analysis once with and once without the GFP in the DB.

Screenshot 2024-03-07 at 17 09 47 venndiagram

Can you please comment on the observed strong scattering of PG.Quantity and the only 92% overlap between those two searches? How is it possible, that a 0.004% larger search space has such a tremendous influence on individual quantification of proteins? Is there a way to reduce this effect (We've already selected the "unrelated samples" option for this analysis)?

We’ve further also looked at the correlation in relative quantification by plotting the foldchanges of two sample groups with E.coli spike-in in a ratio 1:2 on human background (after processing with MSstats):

FCcomparison

Also here we observed quite some scattering around the expected 2 fold change for the E.coli proteins (1,1) and the human proteins (0,0). I personally have a hard time to accept, that the addition of a single unrelated protein to the search space can completely change the regulation of some (E.coli) proteins.

Thank you already in advance for looking into this once more and for your explanations. If you need any additional information, I’m happy to help.

Best, Christian

vdemichev commented 6 months ago

Hi Christian,

This is actually expected, the difference observed is a random fluctation. If I were to change the seed for the random number generator used by DIA-NN, it would have exactly the same effect.

Peptides are searched in batches. One single protein extra changes completely how peptides (randomly) are distributed across batches. Which in turn changes based on which peptides (i) mass calibration is performed and (more importantly) MS2 mass accuracy is determined - potentially major effect on quantification; (ii) RT alignment is performed. Here (i) changes correlation scores and measured signal intensities of many fragments. Which means for many peptides different fragments are selected for quantification, which easily leads to fold-changes in absolute quantities (note: not in relative quantites, but in absolute quantities, which are not supposed to be meaningful in label-free proteomics anyway). In addition to that, there will be peptides identified in one case but not the other, which can signficantly affect absolute quantification of proteins, with much less noticeable effects on relative quantification. On a scatterplot the relative quant correlation might appear even lower - but this is an effect of much greater relative impact of noise.

So yes, one extra protein searched means a noticeable change. If you fix mass accuracies & scan window you will get quantities that are more similar but still not identical. You can try --no-batch-mode - this will almost eliminate peptide-level quantites variation and protein-level one will be reduced to that caused by differential identification of peptides. But this will be a lot slower. Speed optimisations mean that one extra protein can indeed change things a lot. Another option - use --ref to suppy a separate library for calibration - I think it will only work without MBR and should reduce variation too, but this is currently labelled as 'experimental' and not well tested.

Is such observed variation a problem? No. Because these differences are likely a lot less than you will see between repeat injections of the same sample even on a most robust LC system. Also, important to keep in mind that everything software produces as output is, in the case of IDs, based on statistically-controlled confidence (not absolute confidence), and in the case of quantities, is meant solely for relative (not absolute) quantification. If you are using QuantUMS in latest betas - then quantities are also statistically justified - when it comes to LC-MS errors control.

I personally have a hard time to accept, that the addition of a single unrelated protein to the search space can completely change the regulation of some (E.coli) proteins.

Some considerations: 1% FDR means in the worst case that 1% or reported quantites have nothing to do with the real analyte levels. In practice, the number of unreliable quantities is higher, as the ability of the software to identify things currently far exceeds its capabilities for accurate quant - we address this in QuantUMS, but here it's DIA-NN 1.8.1 which does not have quantification quality control. So a number of outliers on the scatterplot comparable or less than 1% of total IDs is very much expected for any minor change in the processing settings.

Best, Vadim

chscho commented 6 months ago

Dear Vadim

Thank you very much for this great explanation of the inner workings of DIA-NN. Just out of curiosity: Is --no-batch-mode equivalent to setting --threads to 1? Or is the number of batches not solely dependent on the number of available threads?

Nevertheless, I will definitely also have a look at QuantUMS soon...

Best, Christian

vdemichev commented 6 months ago

--no-batch-mode makes DIA-NN to search all library precursors to perform calibration. This is very slow but calibration will be somewhat better (negligible difference in almost all cases). No, batches are fixed size and their number does not depend on the number of threads. This would not be good to have these connected: the search results must not depend on the number of threads.

chscho commented 6 months ago

Ah, that makes a lot of sense. Thank you for your prompt reply!