Closed herrivas closed 3 years ago
Thank you.
I would need a reproducible example, since the examples in ?pcaCoDa works and I never have seen this problem.
Best Matthias
Hello,
I thought that I'd share here how I get ggplots of robust compositional biplots from the output of pcaCoDa, in the hope that it is useful for this package's users and developers.
I prefer to sometimes use ggfortify's autoplot.pca_common rather than biplot.pcaCoDa as the resulting biplots are easier to read. This is often the case when there are many loading arrows to plot.
I think that @herrivas encounters this error as ggplot2 requires the scale and center values returned from prcomp to have the same length as the number of columns in the transformed data. This does not appear to be a requirement of biplot.pcaCoDa. As has been observed, scale and center returned from princompOutputClr in the pcaCoDa object still have lengths of D-1 from the ilr transformation of the data. To get a ggplot-based biplot, I convert these outputs to CLR D-dimensional space - I hope this is mathematically correct!
Here is an example script demonstrating how I do this:
library(ggplot2) #to plot the biplot as a ggplot
library(ggfortify) # to provide biplot options to ggplot2
library(robCompositions)
data("arcticLake")
AL.pca <- pcaCoDa(arcticLake)
#A biplot.pcaCoDa plot to compare with the ggplot2 output. This is Figure 1
biplot(AL.pca, xlim=c(-0.3,0.6))
autoplot(AL.pca)
#Error: Objects of type pcaCoDa not supported by autoplot.
#As the above errors, run autoplot directly on the princompOutputClr item
autoplot(AL.pca$princompOutputClr)
#Error in scale.default(data, center = FALSE, scale = 1/scale) :
#length of 'scale' must equal the number of columns of 'x'
#save princompOutputClr item to variable
dat <- AL.pca$princompOutputClr
#These lengths being shorter than the original dataset dimensions cause the scale.default error given above
length(dat$center)
#D-1 dimensions = 2
length(dat$scale)
#D-1 dimensions = 2
#generate CLR conversion matrix V as per pcaCoDa source code
V <- matrix(0, nrow = ncol(arcticLake), ncol = ncol(arcticLake) - 1)
for (i in 1:ncol(V)) {
V[1:i, i] <- 1/i
V[i + 1, i] <- (-1)
V[, i] <- V[, i] * sqrt(i/(i + 1))
}
#convert center and scale to CLR basis
sc <- V %*% dat$scale
cen <- V %*% dat$center
#overwrites scale and center with CLR versions as vectors
dat$scale <- c(sc)
dat$center <- c(cen)
#try autoplot which calls ggfortify::ggbiplot in the background. This is Figure 2.
p <- autoplot(dat, loadings=T, loadings.label = T, loadings.label.repel=T, label=T, label.repel=T)
For reference, this is my SessionInfo:
sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8
[6] LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] robCompositions_2.3.0 data.table_1.14.0 pls_2.7-3 ggfortify_0.4.11 ggplot2_3.3.5
loaded via a namespace (and not attached):
[1] bitops_1.0-7 RColorBrewer_1.1-2 prabclus_2.3-2 tools_4.1.0 utf8_1.2.1 R6_2.5.0
[7] KernSmooth_2.23-20 DBI_1.1.1 colorspace_2.0-2 nnet_7.3-16 NADA_1.6-1.1 withr_2.4.2
[13] sp_1.4-5 tidyselect_1.1.1 gridExtra_2.3 GGally_2.1.2 curl_4.3.2 compiler_4.1.0
[19] sROC_0.1-2 labeling_0.4.2 diptest_0.76-0 scales_1.1.1 lmtest_0.9-38 DEoptimR_1.0-9
[25] mvtnorm_1.1-2 robustbase_0.93-8 proxy_0.4-26 digest_0.6.27 stringr_1.4.0 foreign_0.8-81
[31] rainbow_3.6 rio_0.5.27 rrcov_1.5-5 pkgconfig_2.0.3 rlang_0.4.11 readxl_1.3.1
[37] zCompositions_1.3.4 farver_2.1.0 generics_0.1.0 zoo_1.8-9 mclust_5.4.7 dplyr_1.0.7
[43] zip_2.2.0 car_3.0-11 RCurl_1.98-1.3 magrittr_2.0.1 modeltools_0.2-23 Matrix_1.3-4
[49] Rcpp_1.0.6 munsell_0.5.0 fansi_0.5.0 abind_1.4-5 lifecycle_1.0.0 stringi_1.6.2
[55] cvTools_0.3.2 carData_3.0-4 MASS_7.3-54 flexmix_2.3-17 plyr_1.8.6 grid_4.1.0
[61] ggrepel_0.9.1 parallel_4.1.0 forcats_0.5.1 crayon_1.4.1 lattice_0.20-44 haven_2.4.1
[67] splines_4.1.0 hms_1.1.0 pillar_1.6.1 ranger_0.12.1 boot_1.3-28 fpc_2.2-9
[73] fda_5.1.9 stats4_4.1.0 glue_1.4.2 laeken_0.5.1 vcd_1.4-8 vctrs_0.3.8
[79] VIM_6.1.0 cellranger_1.1.0 gtable_0.3.0 purrr_0.3.4 tidyr_1.1.3 reshape_0.8.8
[85] kernlab_0.9-29 assertthat_0.2.1 ks_1.13.1 openxlsx_4.2.4 fds_1.8 pracma_2.3.3
[91] e1071_1.7-7 survival_3.2-11 class_7.3-19 pcaPP_1.9-74 truncnorm_1.0-8 tibble_3.1.2
[97] cluster_2.1.2 ellipsis_0.3.2 hdrcde_3.4
I hope this helps! Thank you for reading my post.
Figure 1: biplot.pcaCoDa output:
Figure 2: autoplot output:
Sorry to come back so late. It definitely a nice change to autoplot and ggfortify. Will check this hopefully this week, otherwise early next week. It looks fine at the first sight, but I need time to check carefully.
First, thanks for the creative work! The original authors (Karel and Peter) had in mind to transform the clr coordinates to ilr coordinates to be able to use robust estimators, which would be neglected in your version).
I think the main task would be rather to compare prcomp's biplot with autoplot.prcomp with care.
biplot(prcomp(expenditures))
autoplot(prcomp(expenditures), loadings=T, loadings.label = T, loadings.label.repel=T, label=T, label.repel=T)
Only when knowing that ggfortify/autoplot does the right thing (for which I now have my doubts), it would be good to use it in robCompositions.
They differ, look at the loadings.
Because a simple solution for pcaCoDa should be simple to use
library(ggplot2) #to plot the biplot as a ggplot
library(ggfortify) # to provide biplot options to ggplot2
library(robCompositions)
data("arcticLake")
pc <- pcaCoDa(expenditures)
dat <- pc$princompOutputClr
dat$scale <- rep(1,5)
dat$center <- c(dat$center, 1)
autoplot(dat, loadings=T, loadings.label = T, loadings.label.repel=T, label=T, label.repel=T)
biplot(dat)
# they differ a lot, should not be the case when plotting the first 2 components.
pc <- pcaCoDa(expenditures)
dat <- pc$princompOutputClr
dat$scale <- rep(1,5)
dat$center <- c(dat$center, -10)
autoplot(dat, loadings=T, loadings.label = T, loadings.label.repel=T, label=T, label.repel=T)
biplot(dat) # same as above, correct: no influence of the last center value!
autoplot(dat, loadings=T, loadings.label = T, loadings.label.repel=T, label=T, label.repel=T, x=3, y=4)
biplot(dat, choices = 3:4)
I included a version in the github version, see the ?plot.pca exercises.
I am trying to plot the result of robust PCA analysis of geochemical data with ggbiplot, but I receive the following error:
" Error in scale.default(data, center = FALSE, scale = 1/scale) : length of 'scale' must equal the number of columns of 'x' "
when analyzing the str() of the pcaCoDa.obj it still has one dimension less than the original data (D-1). Could be possible that this data is still in the ilr space and has not been transformed to the clr space (sensu Filzmoser et al., 2009)? When checking the line 177 of the pcaCoDa code, the creation of the variable pcaClr takes the same values of pcaIlr (pcaClr <- pcaIlr) without any transformation.
Could you help me with this issue?