Closed TomSmithCGAT closed 5 years ago
Just an update to say that for now I'm going to keep this functionality in a separate package for benchmarking spatial proteomics workflows. Very early stages for the package: https://github.com/TomSmithCGAT/OptProc. Unless you think this would be good to include in MSnbase, this issue can be closed.
To test suitable imputation approaches for missing quantification values in a given proteomics experiment, it's usually beneficial to start by simulating missing values. When considering MCAR, simulating missing values is relatively straightforward. However, missing values in proteomics are typically MAR, and whilst we might have some idea what causes the missing values, it's not straighforward to simulate the generative process.
I propose a data-driven solution for isobaric tagging quantification which takes advantage of the fact that we frequently have multiple PSMs for the same peptide. Missing values can therefore be reasonably simulated by taking pairs of PSMs from the same peptide where one contains missing values and the other doesn't (PSMm & PSMnm respectively). We can then add missing values in PSMnm at the same position as those in PSMm and scale the intensities in PSMnm to PSMm in order to create a PSM which is broadly similar to the PSMm but with a ground truth for the missing values. The overall structure of the simulated and real missing values in the dataset is therefore similar without needing to model the generative process for the missing values in a given dataset. Missing values could theoretically also be simulated at the peptide level using the same approach on the assumption that peptides from the same protein should have the same relative abundances in each sample, however this assumption will not hold true in all cases so this extension may require additional checks for divergent peptides.
For LFQ, we only obtain quantification values for each sample at the peptide or protein level, depending on how the PSM level quantification values are aggregated. It would be possible to use the above approach to simulate missing values in peptide-level LFQ quantification but this is dependent on the same assumption as above and I haven't given much thought to whether this would be useful given I expect missing values are typically only dealt with at the protein level with LFQ?
I think it would be helpful to have a function in
MSnbase
(orpRoloc
?) to simulate missing values as proposed, possibly with an additional function to compare downstream imputed values with the ground truth. This would allow users to easily test different imputation approaches. My concern is that this function may only be suitable for isobaric tagging but I guess this can be made clear in the documentation?The function below is my first attempt at an implementation (suitable input data here). If you'd be happy for this simulation approach to be included in
MSnbase
, I'll add some checks for edge cases etc, document and issue a proper PR.