ipb-halle / MetFamily

Understanding metabolism is fundamental in biomedical and plant research and the identification and quantification of thousands of metabolites by mass spectrometry in modern metabolomics is a prerequisite for elucidating this area. However, the identification of metabolites is a major bottleneck in traditional approaches hampering advances. Here, we present a novel approach for the untargeted discovery of metabolite families offering a bird's eye view of metabolic regulation in comparative metabolomics. We implemented the presented methodology in the easy-to-use web application MetFamily to enable the analysis of comprehensive metabolomics studies for all researchers worldwide. MetFamily is available under http://msbi.ipb-halle.de/MetFamily/.
GNU General Public License v3.0
10 stars 7 forks source link

PCA looks different on current R #74

Closed sneumann closed 8 months ago

sneumann commented 9 months ago

Hi, when switching from R underneath MetFamily from R-3.6.3 to R-4.X, the PCA looks different:

R-3.6.3 gives: PCA363

but R-4.X gives something different: (both 4.0.5 and 4.3.2 give identical result) PCA405

@khabatv identified (thanks!) that the function has switched the default method (SVD, EVD), so it should help to explicitly specify the method in https://github.com/ipb-halle/MetFamily/blob/25c56c44031557c7e75783c9cee198ce74d00b9f/R/Analysis.R#L838

Yours, Steffen

sneumann commented 8 months ago

Hi, to document the progress of catching this, it is not the actual calculation of the PCA that causes the difference. Turns out that already the input data to the PCA in performPca() differs: Default calculation is with https://github.com/ipb-halle/MetFamily/blob/5348757c1b3b01d73cf8024e9bdf7464a3fbdbd4/R/Analysis.R#L856

So for 3.6.3 we get

> load("dataFrame2.Rdata") ; dim(dataFrame2) ; summary(dataFrame2[,1]) ; summary(t(dataFrame2[1,]))
[1]   6 219
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-132.66 -132.42  -34.24    0.00   89.57  235.44 
     TRI03        
 Min.   : -49.16,  1st Qu.: 183.76  ,  Median : 249.74  ,  Mean   : 350.71  ,  3rd Qu.: 379.34  ,  Max.   :2156.67 

And for 4.3.2 we get for the same input

> load("dataFrame2.Rdata") ; dim(dataFrame2) ; summary(dataFrame2[,1]) ; t(summary(t(dataFrame2[1,])))
[1]   6 219
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-19.075 -18.563  -6.375   0.000  11.204  37.388                                                                              
    TRI03 
Min.   :-30.62   1st Qu.: 12.71   Median : 32.24   Mean   : 23.58  ,  3rd Qu.: 37.67   Max.   : 55.70  

And already the input data to processing is different: https://github.com/ipb-halle/MetFamily/blob/5348757c1b3b01d73cf8024e9bdf7464a3fbdbd4/R/Analysis.R#L732

3.6.3:

load("dataFrame.Rdata") ; dim(dataFrame) ; summary(dataFrame[,1]) ; summary(dataFrame[1,])
[1]   6 219
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  314.0   352.2 15674.1 21016.8 34995.0 57758.7 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  12764   39359   64776  257063  160314 5032056 

4.3.2:

> load("dataFrame.Rdata") ; dim(dataFrame) ; summary(dataFrame[,1]) ; summary(dataFrame[1,])
[1]   6 219
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  435.0   446.8   726.5   872.8  1130.0  1731.0 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   10.0   752.5  1357.0  1235.9  1718.5  2116.0 

The data is originally loaded into the dataList$dataFrameMeasurements and differs already there: 3.6.3

> load("filtering.Rdata") ; dim(dataList$dataFrameMeasurements) ; summary(t(dataList$dataFrameMeasurements[1,1:6])); summary(dataList$dataFrameMeasurements[,1]); 
[1] 2585   13
     85.005 /   7.42
 Min.   :2930        1st Qu.:3333        Median :4718        Mean   :5263        3rd Qu.:6252        Max.   :9578          Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0    1630    3342   27685    8470 5032056 

4.3.2

load("filtering.Rdata") ; dim(dataList$dataFrameMeasurements) ; summary(t(dataList$dataFrameMeasurements[1,1:6])); summary(dataList$dataFrameMeasurements[,1]); 
[1] 2585   13
     85.005 /   7.42
 Min.   : 411.0      1st Qu.: 529.8      Median : 843.0      Mean   : 992.5      3rd Qu.:1478.8      Max.   :1742.0     
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0   357.0   915.0   945.3  1501.0  2116.0 

The difference is already present in processMS1data() https://github.com/ipb-halle/MetFamily/blob/5348757c1b3b01d73cf8024e9bdf7464a3fbdbd4/R/DataProcessing.R#L706 where the input parameter metaboliteProfile is identical, but the resulting matrixDataFrame() differs: https://github.com/ipb-halle/MetFamily/blob/5348757c1b3b01d73cf8024e9bdf7464a3fbdbd4/R/DataProcessing.R#L804

It seems we are getting (vastly!) different results depending on how we convert from string input to numbers:

summary(as.numeric(metaboliteProfile[, "TRI03"]))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0    1630    3342   27685    8470 5032056 
Browse[2]> summary(data.matrix(metaboliteProfile[, "TRI03", drop = FALSE]))
     TRI03       
 Min.   :   1.0  
 1st Qu.: 357.0  
 Median : 915.0  
 Mean   : 945.3  
 3rd Qu.:1501.0  
 Max.   :2116.0  
sneumann commented 8 months ago

So, starting a new comment, as we don't need the code in the MetFamily shiny app anymore, and not even the MetFamily container. The issue is already in the rocker/shiny base containers. There is different behaviour between the R versions (and shiny dependency packages installed) in 3.6.3 and 4.3.2.

The minimum reproducible example needs the attached Rdata file (do not uncompress! The .gz ending is only to make GitHub happy !) in the current directory metaboliteProfileOrig.Rdata.gz

And run it in the shiny containers with different base versions:

sneumann@msbi-corei:~/$ docker run -it --rm --entrypoint=/usr/local/bin/R -v $PWD:/out rocker/shiny:3.6.3 -e 'load("/out/metaboliteProfileOrig.Rdata.gz"); summary(as.numeric(metaboliteProfile[, "TRI03"])) ; summary(data.matrix(metaboliteProfile[, "TRI03", drop = FALSE])); sessionInfo()'

R version 3.6.3 (2020-02-29) -- "Holding the Windsock"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> load("/out/metaboliteProfileOrig.Rdata.gz"); summary(as.numeric(metaboliteProfile[, "TRI03"])) ; summary(data.matrix(metaboliteProfile[, "TRI03", drop = FALSE])); sessionInfo()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0    1630    3342   27685    8470 5032056 
     TRI03        
 Min.   :      0  
 1st Qu.:   1630  
 Median :   3342  
 Mean   :  27685  
 3rd Qu.:   8470  
 Max.   :5032056  
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 10 (buster)

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.5.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.6.3
> 

And the broken 4.3.2:

sneumann@msbi-corei:~/$ docker run -it --rm --entrypoint=/usr/local/bin/R -v $PWD:/out rocker/shiny:4.3.2 -e 'load("/out/metaboliteProfileOrig.Rdata.gz"); summary(as.numeric(metaboliteProfile[, "TRI03"])) ; summary(data.matrix(metaboliteProfile[, "TRI03", drop = FALSE])); sessionInfo()'

R version 4.3.2 (2023-10-31) -- "Eye Holes"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> load("/out/metaboliteProfileOrig.Rdata.gz"); summary(as.numeric(metaboliteProfile[, "TRI03"])) ; summary(data.matrix(metaboliteProfile[, "TRI03", drop = FALSE])); sessionInfo()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0    1630    3342   27685    8470 5032056 
     TRI03       
 Min.   :   1.0  
 1st Qu.: 357.0  
 Median : 915.0  
 Mean   : 945.3  
 3rd Qu.:1501.0  
 Max.   :2116.0  
R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.3.2
sneumann commented 8 months ago

So, I pinpointed the changed bevaviour to

https://cran.r-project.org/bin/windows/base/old/4.0.0/NEWS.R-4.0.0.html

NEW FEATURES
[...]
data.matrix() now converts character columns to factors and from this to integers.

=> Now we need to fix MetFamily, occurrences of data.matrix() seem to be limited to ./R/DataProcessing.R