hredestig / pcaMethods

Perform PCA on data with missing values in R
GNU General Public License v2.0
45 stars 10 forks source link

projecting nlpca to new data sets #2

Closed topepo closed 5 years ago

topepo commented 7 years ago

I'd like to estimate an autoencoder from one data set and apply it to another with the same number of variables but with a different number of rows.

> library(pcaMethods)
> 
> set.seed(1)
> in_train <- sample(1:150, 100)
> tr <- iris[ in_train, -5]
> te <- iris[-in_train, -5]
> 
> nlpca_obj <- pca(tr, nPcs=2, method="nlpca", maxSteps=500, verbose = FALSE)
> 
> head(fitted(nlpca_obj, tr))
         [,1]     [,2]     [,3]      [,4]
[1,] 5.050568 3.467380 1.425588 0.2393514
[2,] 5.795947 2.718434 4.372717 1.4699193
[3,] 5.588153 2.669189 4.398656 1.5534895
[4,] 6.368556 2.895570 4.956616 1.6964933
[5,] 4.718083 3.082611 1.499397 0.2413291
[6,] 7.356033 3.224832 5.930248 2.0810760
> 
> fitted(nlpca_obj, te)
Error in .Method(..., deparse.level = deparse.level) : 
  number of columns of matrices must match (see arg 2)
> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.4

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] pcaMethods_1.66.0   Biobase_2.34.0      BiocGenerics_0.20.0

loaded via a namespace (and not attached):
[1] tools_3.3.3  Rcpp_0.12.10

I can't think of a analytical reason that this wouldn't work.

Thanks

(related to topepo/recipes#35)

hredestig commented 7 years ago

It is quite some time ago since I worked with this but I agree, I don't see a reason why this shouldn't be possible. The implementation doesn't allow for it since the fitted function is meant for exactly that, getting the fitted data to the training data, and not for new data. There is also a predict function for new data but not implemented for nonlinear PCA. This could probably be implemented but as you also note in the thread you reference, the nlpca is also extremely slow so I wonder if this is really the way to go anyway or if it wouldn't be better to do a more complete overhaul. Pull requests are welcome :)

gdkrmr commented 6 years ago

I just re-read the corresponding paper and there is a catch: I think nlPCA in pcaMethods only implements the decoder part of an autoencoder and optimizes the representation in reduced dimensions, therefore there is no easy way from data space to nl-PCA space and new points have to be optimized via gradient descent or a similar method.

hredestig commented 5 years ago

Indeed, not straight-forward.. Closing this one.