lgatto / MSnbase

Base Classes and Functions for Mass Spectrometry and Proteomics
http://lgatto.github.io/MSnbase/
124 stars 50 forks source link

Simulating missing values #476

Closed TomSmithCGAT closed 5 years ago

TomSmithCGAT commented 5 years ago

To test suitable imputation approaches for missing quantification values in a given proteomics experiment, it's usually beneficial to start by simulating missing values. When considering MCAR, simulating missing values is relatively straightforward. However, missing values in proteomics are typically MAR, and whilst we might have some idea what causes the missing values, it's not straighforward to simulate the generative process.

I propose a data-driven solution for isobaric tagging quantification which takes advantage of the fact that we frequently have multiple PSMs for the same peptide. Missing values can therefore be reasonably simulated by taking pairs of PSMs from the same peptide where one contains missing values and the other doesn't (PSMm & PSMnm respectively). We can then add missing values in PSMnm at the same position as those in PSMm and scale the intensities in PSMnm to PSMm in order to create a PSM which is broadly similar to the PSMm but with a ground truth for the missing values. The overall structure of the simulated and real missing values in the dataset is therefore similar without needing to model the generative process for the missing values in a given dataset. Missing values could theoretically also be simulated at the peptide level using the same approach on the assumption that peptides from the same protein should have the same relative abundances in each sample, however this assumption will not hold true in all cases so this extension may require additional checks for divergent peptides.

For LFQ, we only obtain quantification values for each sample at the peptide or protein level, depending on how the PSM level quantification values are aggregated. It would be possible to use the above approach to simulate missing values in peptide-level LFQ quantification but this is dependent on the same assumption as above and I haven't given much thought to whether this would be useful given I expect missing values are typically only dealt with at the protein level with LFQ?

I think it would be helpful to have a function in MSnbase (or pRoloc?) to simulate missing values as proposed, possibly with an additional function to compare downstream imputed values with the ground truth. This would allow users to easily test different imputation approaches. My concern is that this function may only be suitable for isobaric tagging but I guess this can be made clear in the documentation?

The function below is my first attempt at an implementation (suitable input data here). If you'd be happy for this simulation approach to be included in MSnbase, I'll add some checks for edge cases etc, document and issue a proper PR.

addMissing <- function(obj, id_column, max_missing=Inf, verbose=TRUE){

  # identify ids with missing values
  missing <- obj[rowSums(is.na(obj))>0,]  
  id_missing <- fData(missing)[[id_column]]

  # identify ids without missing values
  not_missing <- obj[rowSums(is.na(obj))==0,]
  id_not_missing <- fData(not_missing)[[id_column]]

  # identify ids with both missing and no missing
  id_both <- sample(intersect(id_missing, id_not_missing))

  # how many features without missing values for these ids
  n_features <- nrow(not_missing[fData(not_missing)[[id_column]] %in% id_both,])

  if(verbose){
    message(
      sprintf("%s features where missing values can be simulated. These comprise %s unique meta-features\n",
              n_features, length(id_both)))
  }

  # create new array matrices to hold the truth and missing quantification values
  truth_e <- missing_e <- exprs(obj)

  missing_n <- 0

  # nested for loop. Probably more efficient way to perform this but
  # run time is ~6s for 80K PSM dataset
  for(id in id_both){

    if(missing_n == max_missing) break

    # get exprs for id and features +/- missing values
    .e <- exprs(obj)[fData(obj)[[id_column]]==id,]
    .e_m <- .e[rowSums(is.na(.e))>0,,drop=FALSE]
    .e_nm <- .e[rowSums(is.na(.e))==0,,drop=FALSE]

    feature_nm <- rownames(.e_nm)
    feature_m <- rownames(.e_m)

    # obtain a random set of pairs of +/- missing, without replacement of either
    rand_m <- sample(feature_m, 
                     size=min(length(feature_nm), length(feature_m)),
                     replace=FALSE)
    feature_nm <- sample(feature_nm)[1:length(rand_m)]

    for(ix in seq_along(feature_nm)){

      if(missing_n == max_missing) break

      # get pair of features +/- missing values
      id_nm <- feature_nm[ix]
      id_m <- rand_m[ix]

      # get intensity ratio for feature pair
      ave_e_m <- sum(.e_m[id_m,], na.rm=TRUE)
      ave_e_nm <- sum(.e_nm[id_nm,])
      intensity_ratio <- ave_e_nm/ave_e_m

      # scale down the -m feature and add missing values
      reduced_values <- .e_nm[id_nm,]/intensity_ratio
      reduced_values_add_m <- reduced_values
      reduced_values_add_m[is.na(.e_m[id_m,])] <- NA

      # replace feature quantification values in array matrices
      truth_e[id_nm, ] <- reduced_values
      missing_e[id_nm, ] <- reduced_values_add_m

      missing_n <- missing_n + 1
    }
  }

  truth <- missing <- obj
  exprs(truth) <- truth_e
  exprs(missing) <- missing_e

  # identify positions of missing values
  sim_missing <- rowSums(is.na(truth_e))==0 & rowSums(is.na(missing_e))>0
  fData(missing)$sim_missing <- sim_missing
  fData(truth)$sim_missing <- sim_missing

  if(verbose){
    message(sprintf("%s/%s features have had missing values simulated",
                    sum(fData(missing)$sim_missing),
                    nrow(missing)))
  }

  invisible(list("truth"=truth, "missing"=missing))
}

psm_level <- readRDS("psm_level_lopit_dc_quant.rds")
add_missing_psm <- addMissing(psm_level, "id", max_missing=100, verbose=TRUE)
TomSmithCGAT commented 5 years ago

Just an update to say that for now I'm going to keep this functionality in a separate package for benchmarking spatial proteomics workflows. Very early stages for the package: https://github.com/TomSmithCGAT/OptProc. Unless you think this would be good to include in MSnbase, this issue can be closed.