impute(msn, method = "nbavg") potential bug

Tlicknack commented 2 years ago

Hi-

This issue may be unique to my data-set, but I'm noticing that when impute() encounters a string of NAs before a true value, it rightfully imputes the lowest global value in the data-set in all but the rightmost, NA-containing column. In this column, it averages the low, imputed values with the true, rightmost value. This erroneously gives a strong signal in that 2nd to right column (especially when sum-normalized; NA becomes ~0.333).

Example of my raw data:

> exprs(msn[1])
# MAC.1 X300g.1 X1K.1 X3K.1 X5K.1 X9K.1 X12K.1 X15K.1 X30K.1 X79K.1 X120K.1    SUP.1
# NA      NA    NA    NA    NA    NA     NA     NA     NA     NA      NA 7.563419

Example of the imputed/normalized data:

> exprs(msn_imputed[1])
# MAC.1     X300g.1       X1K.1       X3K.1       X5K.1       X9K.1      X12K.1      X15K.1      X30K.1      X79K.1 X120K.1     SUP.1
# 0.003954851 0.003954851 0.003954851 0.003954851 0.003954851 0.003954851 0.003954851 0.003954851 0.003954851 0.003954851 0.3214688 0.6389827

Note that the X120K.1 column is the average of X79K.1 and SUP.1. This doesn't really make sense and creates an odd line across the PCA.

Thanks,

-Tim

lgatto commented 2 years ago

Could you also provide your sessionInfo() output, please.

Tlicknack commented 2 years ago

Hi Laurent,

Thanks for the quick reply! And apologies for not including that in the original inquiry (and apologies for not putting this in MSnbase).

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /packages/7x/R/4.0.2/lib64/R/lib/libRblas.so
LAPACK: /packages/7x/R/4.0.2/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pRolocGUI_1.22.0     seqinr_4.2-5         ggplot2_3.3.5        dplyr_1.0.7          pRoloc_1.28.0        BiocParallel_1.24.1 
 [7] MLInterfaces_1.68.0  cluster_2.1.1        annotate_1.66.0      XML_3.99-0.8         AnnotationDbi_1.52.0 IRanges_2.24.1      
[13] MSnbase_2.14.2       ProtGenerics_1.20.0  S4Vectors_0.28.1     mzR_2.22.0           Rcpp_1.0.7           Biobase_2.50.0      
[19] BiocGenerics_0.36.0 

loaded via a namespace (and not attached):
  [1] BiocFileCache_1.14.0  plyr_1.8.6            splines_4.0.2         digest_0.6.27         foreach_1.5.1         htmltools_0.5.1.1    
  [7] viridis_0.6.0         fansi_1.0.2           magrittr_2.0.1        memoise_2.0.1         doParallel_1.0.16     mixtools_1.2.0       
 [13] limma_3.44.3          recipes_0.1.16        gower_0.2.2           askpass_1.1           lpSolve_5.6.15        prettyunits_1.1.1    
 [19] colorspace_2.0-2      blob_1.2.1            rappdirs_0.3.3        xfun_0.22             crayon_1.4.2          RCurl_1.98-1.3       
 [25] hexbin_1.28.2         impute_1.62.0         survival_3.2-10       iterators_1.0.13      glue_1.4.2            gtable_0.3.0         
 [31] ipred_0.9-11          zlibbioc_1.36.0       kernlab_0.9-29        scales_1.1.1          futile.options_1.0.1  vsn_3.56.0           
 [37] mvtnorm_1.1-1         DBI_1.1.2             viridisLite_0.4.0     xtable_1.8-4          progress_1.2.2        bit_4.0.4            
 [43] proxy_0.4-25          mclust_5.4.8          preprocessCore_1.50.0 DT_0.18               lava_1.6.9            prodlim_2019.11.13   
 [49] htmlwidgets_1.5.3     sampling_2.9          httr_1.4.2            FNN_1.1.3             RColorBrewer_1.1-2    ellipsis_0.3.2       
 [55] pkgconfig_2.0.3       nnet_7.3-15           dbplyr_2.1.1          utf8_1.2.2            caret_6.0-86          later_1.1.0.1        
 [61] tidyselect_1.1.0      rlang_0.4.11          reshape2_1.4.4        munsell_0.5.0         tools_4.0.2           LaplacesDemon_16.1.6 
 [67] cachem_1.0.6          cli_2.5.0             generics_0.1.0        RSQLite_2.2.8         ade4_1.7-16           stringr_1.4.0        
 [73] fastmap_1.1.0         mzID_1.26.0           ModelMetrics_1.2.2.2  knitr_1.34            bit64_4.0.5           purrr_0.3.4          
 [79] randomForest_4.6-14   dendextend_1.15.2     ncdf4_1.17.1          nlme_3.1-152          mime_0.11             formatR_1.9          
 [85] xml2_1.3.2            biomaRt_2.44.4        compiler_4.0.2        curl_4.3              e1071_1.7-6           affyio_1.58.0        
 [91] tibble_3.1.3          stringi_1.5.3         futile.logger_1.4.3   lattice_0.20-41       Matrix_1.3-2          vctrs_0.3.8          
 [97] pillar_1.6.4          lifecycle_1.0.1       BiocManager_1.30.12   MALDIquant_1.20       data.table_1.14.2     bitops_1.0-6         
[103] httpuv_1.5.5          R6_2.5.0              pcaMethods_1.80.0     affy_1.66.0           promises_1.2.0.1      gridExtra_2.3        
[109] codetools_0.2-18      lambda.r_1.2.4        MASS_7.3-53.1         gtools_3.8.2          assertthat_0.2.1      openssl_1.4.3        
[115] withr_2.4.3           hms_1.1.1             grid_4.0.2            rpart_4.1-15          timeDate_3043.102     coda_0.19-4          
[121] class_7.3-18          segmented_1.3-3       pROC_1.17.0.1         shiny_1.6.0           lubridate_1.7.10

lgatto commented 2 years ago

Your R version (and this all Bioconductor packages) are outdated. While this isn't the reason for the results your observe, I would recommend to update.

Your results are correct, I believe. Indeed, you'll find in the documentation (here I point to the latest manual page) that

Continuous sets ‘NA’ value at the beginning and the end of the quantitation vectors are set to the lowest observed value in the data ...

This explains the stretch of 0.003954851 that you see. The penultimate value is equal to the average of that minimal value and the one in the last fraction, as computed by the algorithm. Finally, the last value is different from what you had in the beginning due to the normalisation you used.

In conclusion, there's no sign of anything wrong in the data. However, you will need to question the validity of such a profile with all but one value is present. Unless this is indeed a clear case of a protein that resides in an organelle that is only present in the very last fraction, I would suggest to remove such cases.

PS: it's fine to ask this question here in pRoloc, as it's clearly about spatial proteomics data processing.

Tlicknack commented 2 years ago

Thanks for the note about updates.

And thanks for the explanation. I did read the documentation (which was very helpful and easy to navigate) but didn't realize that nbavg did both the global minimum imputation AND averaging on the same protein row. In our fractionation scheme, this will actually convert Cytosolic proteins (i.e. those only present in the Sup-- Supernatant of Diff Centrifugation) into Proteasome-Protein Complexes (which in our hands are abundant in that final--120K-- fraction and the Supernatant). Given this information, I'll just remove/edit these rows.

I appreciate your time! And thanks for making such incredibly useful and intuitive packages for this sort of analysis!

lgatto commented 2 years ago

Well, imputing by neighbour values need some data to work. If the data are missing at the end of the gradient, it is assumed that the protein was absent at these positions, and are imputed by the minimal value and then only, it can perform neighbour averaging.

Your case was rather extreme, with all but the last value missing, and this probably better to remove completely - see filterNA()

lgatto / pRoloc

impute(msn, method = "nbavg") potential bug #145