matthias-da / robCompositions

Robust Methods for Compositional Data
11 stars 8 forks source link

Problem with returned dimensions (D-1) #12

Closed herrivas closed 3 years ago

herrivas commented 3 years ago

I am trying to plot the result of robust PCA analysis of geochemical data with ggbiplot, but I receive the following error:

" Error in scale.default(data, center = FALSE, scale = 1/scale) : length of 'scale' must equal the number of columns of 'x' "

when analyzing the str() of the pcaCoDa.obj it still has one dimension less than the original data (D-1). Could be possible that this data is still in the ilr space and has not been transformed to the clr space (sensu Filzmoser et al., 2009)? When checking the line 177 of the pcaCoDa code, the creation of the variable pcaClr takes the same values of pcaIlr (pcaClr <- pcaIlr) without any transformation.

Could you help me with this issue?

matthias-da commented 3 years ago

Thank you.

I would need a reproducible example, since the examples in ?pcaCoDa works and I never have seen this problem.

Best Matthias

DrewGoodfellow commented 3 years ago

Hello,

I thought that I'd share here how I get ggplots of robust compositional biplots from the output of pcaCoDa, in the hope that it is useful for this package's users and developers.

I prefer to sometimes use ggfortify's autoplot.pca_common rather than biplot.pcaCoDa as the resulting biplots are easier to read. This is often the case when there are many loading arrows to plot.

I think that @herrivas encounters this error as ggplot2 requires the scale and center values returned from prcomp to have the same length as the number of columns in the transformed data. This does not appear to be a requirement of biplot.pcaCoDa. As has been observed, scale and center returned from princompOutputClr in the pcaCoDa object still have lengths of D-1 from the ilr transformation of the data. To get a ggplot-based biplot, I convert these outputs to CLR D-dimensional space - I hope this is mathematically correct!

Here is an example script demonstrating how I do this:

library(ggplot2) #to plot the biplot as a ggplot
library(ggfortify) # to provide biplot options to ggplot2 
library(robCompositions)

data("arcticLake")

AL.pca <-  pcaCoDa(arcticLake)

#A biplot.pcaCoDa plot to compare with the  ggplot2 output. This is Figure 1

biplot(AL.pca, xlim=c(-0.3,0.6))

autoplot(AL.pca)

#Error: Objects of type pcaCoDa not supported by autoplot.

#As the above errors, run autoplot directly on the princompOutputClr item

autoplot(AL.pca$princompOutputClr)

#Error in scale.default(data, center = FALSE, scale = 1/scale) : 
#length of 'scale' must equal the number of columns of 'x'

#save princompOutputClr item to variable

dat <-  AL.pca$princompOutputClr

#These lengths being shorter than the original dataset dimensions cause the scale.default error given above

length(dat$center)
 #D-1 dimensions = 2
length(dat$scale)
#D-1 dimensions = 2

#generate  CLR  conversion matrix V as per pcaCoDa source code

V <- matrix(0, nrow = ncol(arcticLake), ncol = ncol(arcticLake) - 1)
for (i in 1:ncol(V)) {
  V[1:i, i] <- 1/i
  V[i + 1, i] <- (-1)
  V[, i] <- V[, i] * sqrt(i/(i + 1))
}

#convert center and scale to CLR basis
sc <-  V %*% dat$scale
cen <- V %*% dat$center

#overwrites scale and center with CLR versions as vectors

dat$scale <-  c(sc)

dat$center <-  c(cen)

#try autoplot which calls ggfortify::ggbiplot in the background. This is Figure 2.

p <- autoplot(dat, loadings=T, loadings.label = T, loadings.label.repel=T, label=T, label.repel=T)

For reference, this is my SessionInfo:

sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8   
 [6] LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] robCompositions_2.3.0 data.table_1.14.0     pls_2.7-3             ggfortify_0.4.11      ggplot2_3.3.5        

loaded via a namespace (and not attached):
 [1] bitops_1.0-7        RColorBrewer_1.1-2  prabclus_2.3-2      tools_4.1.0         utf8_1.2.1          R6_2.5.0           
 [7] KernSmooth_2.23-20  DBI_1.1.1           colorspace_2.0-2    nnet_7.3-16         NADA_1.6-1.1        withr_2.4.2        
[13] sp_1.4-5            tidyselect_1.1.1    gridExtra_2.3       GGally_2.1.2        curl_4.3.2          compiler_4.1.0     
[19] sROC_0.1-2          labeling_0.4.2      diptest_0.76-0      scales_1.1.1        lmtest_0.9-38       DEoptimR_1.0-9     
[25] mvtnorm_1.1-2       robustbase_0.93-8   proxy_0.4-26        digest_0.6.27       stringr_1.4.0       foreign_0.8-81     
[31] rainbow_3.6         rio_0.5.27          rrcov_1.5-5         pkgconfig_2.0.3     rlang_0.4.11        readxl_1.3.1       
[37] zCompositions_1.3.4 farver_2.1.0        generics_0.1.0      zoo_1.8-9           mclust_5.4.7        dplyr_1.0.7        
[43] zip_2.2.0           car_3.0-11          RCurl_1.98-1.3      magrittr_2.0.1      modeltools_0.2-23   Matrix_1.3-4       
[49] Rcpp_1.0.6          munsell_0.5.0       fansi_0.5.0         abind_1.4-5         lifecycle_1.0.0     stringi_1.6.2      
[55] cvTools_0.3.2       carData_3.0-4       MASS_7.3-54         flexmix_2.3-17      plyr_1.8.6          grid_4.1.0         
[61] ggrepel_0.9.1       parallel_4.1.0      forcats_0.5.1       crayon_1.4.1        lattice_0.20-44     haven_2.4.1        
[67] splines_4.1.0       hms_1.1.0           pillar_1.6.1        ranger_0.12.1       boot_1.3-28         fpc_2.2-9          
[73] fda_5.1.9           stats4_4.1.0        glue_1.4.2          laeken_0.5.1        vcd_1.4-8           vctrs_0.3.8        
[79] VIM_6.1.0           cellranger_1.1.0    gtable_0.3.0        purrr_0.3.4         tidyr_1.1.3         reshape_0.8.8      
[85] kernlab_0.9-29      assertthat_0.2.1    ks_1.13.1           openxlsx_4.2.4      fds_1.8             pracma_2.3.3       
[91] e1071_1.7-7         survival_3.2-11     class_7.3-19        pcaPP_1.9-74        truncnorm_1.0-8     tibble_3.1.2       
[97] cluster_2.1.2       ellipsis_0.3.2      hdrcde_3.4         

I hope this helps! Thank you for reading my post.

Figure 1: biplot.pcaCoDa output: Fig1

Figure 2: autoplot output: Fig2

matthias-da commented 3 years ago

Sorry to come back so late. It definitely a nice change to autoplot and ggfortify. Will check this hopefully this week, otherwise early next week. It looks fine at the first sight, but I need time to check carefully.

matthias-da commented 3 years ago

First, thanks for the creative work! The original authors (Karel and Peter) had in mind to transform the clr coordinates to ilr coordinates to be able to use robust estimators, which would be neglected in your version).

I think the main task would be rather to compare prcomp's biplot with autoplot.prcomp with care.

biplot(prcomp(expenditures))
autoplot(prcomp(expenditures), loadings=T, loadings.label = T, loadings.label.repel=T, label=T, label.repel=T)

Only when knowing that ggfortify/autoplot does the right thing (for which I now have my doubts), it would be good to use it in robCompositions.

They differ, look at the loadings.

Because a simple solution for pcaCoDa should be simple to use

library(ggplot2) #to plot the biplot as a ggplot
library(ggfortify) # to provide biplot options to ggplot2 
library(robCompositions)

data("arcticLake")

pc <-  pcaCoDa(expenditures)
dat <-  pc$princompOutputClr
dat$scale <- rep(1,5)
dat$center <- c(dat$center, 1)
autoplot(dat, loadings=T, loadings.label = T, loadings.label.repel=T, label=T, label.repel=T)
biplot(dat)
# they differ a lot, should not be the case when plotting the first 2 components.

pc <-  pcaCoDa(expenditures)
dat <-  pc$princompOutputClr
dat$scale <- rep(1,5)
dat$center <- c(dat$center, -10)
autoplot(dat, loadings=T, loadings.label = T, loadings.label.repel=T, label=T, label.repel=T)
biplot(dat) # same as above, correct: no influence of the last center value!

autoplot(dat, loadings=T, loadings.label = T, loadings.label.repel=T, label=T, label.repel=T, x=3, y=4)
biplot(dat, choices = 3:4)
matthias-da commented 3 years ago

I included a version in the github version, see the ?plot.pca exercises.