Closed sneumann closed 8 months ago
Hi, to document the progress of catching this, it is not the actual calculation of the PCA that causes the difference.
Turns out that already the input data to the PCA in performPca()
differs:
Default calculation is with
https://github.com/ipb-halle/MetFamily/blob/5348757c1b3b01d73cf8024e9bdf7464a3fbdbd4/R/Analysis.R#L856
So for 3.6.3 we get
> load("dataFrame2.Rdata") ; dim(dataFrame2) ; summary(dataFrame2[,1]) ; summary(t(dataFrame2[1,]))
[1] 6 219
Min. 1st Qu. Median Mean 3rd Qu. Max.
-132.66 -132.42 -34.24 0.00 89.57 235.44
TRI03
Min. : -49.16, 1st Qu.: 183.76 , Median : 249.74 , Mean : 350.71 , 3rd Qu.: 379.34 , Max. :2156.67
And for 4.3.2 we get for the same input
> load("dataFrame2.Rdata") ; dim(dataFrame2) ; summary(dataFrame2[,1]) ; t(summary(t(dataFrame2[1,])))
[1] 6 219
Min. 1st Qu. Median Mean 3rd Qu. Max.
-19.075 -18.563 -6.375 0.000 11.204 37.388
TRI03
Min. :-30.62 1st Qu.: 12.71 Median : 32.24 Mean : 23.58 , 3rd Qu.: 37.67 Max. : 55.70
And already the input data to processing is different: https://github.com/ipb-halle/MetFamily/blob/5348757c1b3b01d73cf8024e9bdf7464a3fbdbd4/R/Analysis.R#L732
3.6.3:
load("dataFrame.Rdata") ; dim(dataFrame) ; summary(dataFrame[,1]) ; summary(dataFrame[1,])
[1] 6 219
Min. 1st Qu. Median Mean 3rd Qu. Max.
314.0 352.2 15674.1 21016.8 34995.0 57758.7
Min. 1st Qu. Median Mean 3rd Qu. Max.
12764 39359 64776 257063 160314 5032056
4.3.2:
> load("dataFrame.Rdata") ; dim(dataFrame) ; summary(dataFrame[,1]) ; summary(dataFrame[1,])
[1] 6 219
Min. 1st Qu. Median Mean 3rd Qu. Max.
435.0 446.8 726.5 872.8 1130.0 1731.0
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.0 752.5 1357.0 1235.9 1718.5 2116.0
The data is originally loaded into the dataList$dataFrameMeasurements
and differs already there:
3.6.3
> load("filtering.Rdata") ; dim(dataList$dataFrameMeasurements) ; summary(t(dataList$dataFrameMeasurements[1,1:6])); summary(dataList$dataFrameMeasurements[,1]);
[1] 2585 13
85.005 / 7.42
Min. :2930 1st Qu.:3333 Median :4718 Mean :5263 3rd Qu.:6252 Max. :9578 Min. 1st Qu. Median Mean 3rd Qu. Max.
0 1630 3342 27685 8470 5032056
4.3.2
load("filtering.Rdata") ; dim(dataList$dataFrameMeasurements) ; summary(t(dataList$dataFrameMeasurements[1,1:6])); summary(dataList$dataFrameMeasurements[,1]);
[1] 2585 13
85.005 / 7.42
Min. : 411.0 1st Qu.: 529.8 Median : 843.0 Mean : 992.5 3rd Qu.:1478.8 Max. :1742.0
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 357.0 915.0 945.3 1501.0 2116.0
The difference is already present in processMS1data()
https://github.com/ipb-halle/MetFamily/blob/5348757c1b3b01d73cf8024e9bdf7464a3fbdbd4/R/DataProcessing.R#L706
where the input parameter metaboliteProfile
is identical, but the resulting matrixDataFrame()
differs:
https://github.com/ipb-halle/MetFamily/blob/5348757c1b3b01d73cf8024e9bdf7464a3fbdbd4/R/DataProcessing.R#L804
It seems we are getting (vastly!) different results depending on how we convert from string input to numbers:
summary(as.numeric(metaboliteProfile[, "TRI03"]))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 1630 3342 27685 8470 5032056
Browse[2]> summary(data.matrix(metaboliteProfile[, "TRI03", drop = FALSE]))
TRI03
Min. : 1.0
1st Qu.: 357.0
Median : 915.0
Mean : 945.3
3rd Qu.:1501.0
Max. :2116.0
So, starting a new comment, as we don't need the code in the MetFamily shiny app anymore, and not even the MetFamily container. The issue is already in the rocker/shiny
base containers.
There is different behaviour between the R versions (and shiny dependency packages installed) in 3.6.3 and 4.3.2.
The minimum reproducible example needs the attached Rdata file (do not uncompress! The .gz ending is only to make GitHub happy !) in the current directory metaboliteProfileOrig.Rdata.gz
And run it in the shiny containers with different base versions:
sneumann@msbi-corei:~/$ docker run -it --rm --entrypoint=/usr/local/bin/R -v $PWD:/out rocker/shiny:3.6.3 -e 'load("/out/metaboliteProfileOrig.Rdata.gz"); summary(as.numeric(metaboliteProfile[, "TRI03"])) ; summary(data.matrix(metaboliteProfile[, "TRI03", drop = FALSE])); sessionInfo()'
R version 3.6.3 (2020-02-29) -- "Holding the Windsock"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> load("/out/metaboliteProfileOrig.Rdata.gz"); summary(as.numeric(metaboliteProfile[, "TRI03"])) ; summary(data.matrix(metaboliteProfile[, "TRI03", drop = FALSE])); sessionInfo()
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 1630 3342 27685 8470 5032056
TRI03
Min. : 0
1st Qu.: 1630
Median : 3342
Mean : 27685
3rd Qu.: 8470
Max. :5032056
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 10 (buster)
Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.5.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.6.3
>
And the broken 4.3.2:
sneumann@msbi-corei:~/$ docker run -it --rm --entrypoint=/usr/local/bin/R -v $PWD:/out rocker/shiny:4.3.2 -e 'load("/out/metaboliteProfileOrig.Rdata.gz"); summary(as.numeric(metaboliteProfile[, "TRI03"])) ; summary(data.matrix(metaboliteProfile[, "TRI03", drop = FALSE])); sessionInfo()'
R version 4.3.2 (2023-10-31) -- "Eye Holes"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> load("/out/metaboliteProfileOrig.Rdata.gz"); summary(as.numeric(metaboliteProfile[, "TRI03"])) ; summary(data.matrix(metaboliteProfile[, "TRI03", drop = FALSE])); sessionInfo()
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 1630 3342 27685 8470 5032056
TRI03
Min. : 1.0
1st Qu.: 357.0
Median : 915.0
Mean : 945.3
3rd Qu.:1501.0
Max. :2116.0
R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.3.2
So, I pinpointed the changed bevaviour to
https://cran.r-project.org/bin/windows/base/old/4.0.0/NEWS.R-4.0.0.html
NEW FEATURES
[...]
data.matrix() now converts character columns to factors and from this to integers.
=> Now we need to fix MetFamily, occurrences of data.matrix()
seem to be limited to ./R/DataProcessing.R
Hi, when switching from R underneath MetFamily from R-3.6.3 to R-4.X, the PCA looks different:
R-3.6.3 gives:
but R-4.X gives something different: (both 4.0.5 and 4.3.2 give identical result)
@khabatv identified (thanks!) that the function has switched the default method (SVD, EVD), so it should help to explicitly specify the method in https://github.com/ipb-halle/MetFamily/blob/25c56c44031557c7e75783c9cee198ce74d00b9f/R/Analysis.R#L838
Yours, Steffen