H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Following script shows this problem:
----------------------------------------------------------------------
Functions to generate data from multivariate Poisson distribution:
----------------------------------------------------------------------
rMVpois = function(n, A, sigmaY){ p = ncol(A) if(!is.matrix(sigmaY)){sigmaY = diag(sigmaY)} q = nrow(sigmaY) if(p != q){cat("Inputs A and sigmaY are not compatible.")} if(p == q){ indep.pois = matrix(rpois(n = np, lambda = diag(sigmaY)), ncol = n) return(A%%indep.pois) } } is.wholenumber <-function(x, tol = .Machine$double.eps^0.5){abs(x - round(x)) < tol} make.A = function(p, q){ A = NULL if(!is.wholenumber(p) | !is.wholenumber(q)){cat("ERROR: p and q must be non-negative integers.\n")} if(p <2){cat("ERROR: p must be greater than 1.\nTry ?rpois() if p = 1.")} if(p >=2 & is.wholenumber(p) & is.wholenumber(q)){ A.diag = diag(1, p) q.test = c(0, 1)
Independent:
} }
----------------------------------------------------------------------
--------------------------------------------
Generate sample multivariate Possion data:
--------------------------------------------
set.seed(12345) sampleData = t(rMVpois(n = 1000, A = make.A(p=5,q=10), sigmaY = c(rep(20,5),1:10)))
sample structure:
cov(sampleData)
population structure:
make.A(p=5,q=10)%%diag(c(rep(20,5),rep(5,10)))%%t(make.A(p=5,q=10))
-------------------------------------
Run H20 analysis
-------------------------------------
library(h2o) h2o.init()
Create PCA full rank solution using log of data:
pca.hex <- as.h2o(log(sampleData), destination_frame="pca.hex") pca.glrm <- h2o.glrm(training_frame = pca.hex, k = 5, loss = "Quadratic", transform = "NONE", regularization_x = "None", regularization_y = "None", max_iterations = 1000, init = "SVD")
Set initial solution to pca solution:
pca.y <- as.matrix(pca.glrm@model$archetypes) pca.x <- as.matrix(h2o.getFrame(pca.glrm@model$representation_name))
Using full rank Poisson loss should equal full rank PCA solution:
k = 5 mvp.hex <- as.h2o(sampleData, destination_frame="mvp.hex") mvp.glrm <- h2o.glrm(training_frame = mvp.hex, k = k, loss = "Poisson", transform = "NONE", regularization_x = "None", regularization_y = "None",
init = "User", user_y = pca.y[1:k,], user_x = pca.x[,1:k], max_iterations = 1000)
Extract components:
mvp.y <- as.matrix(mvp.glrm@model$archetypes) mvp.x <- as.matrix(h2o.getFrame(mvp.glrm@model$representation_name))
Manual prediction (should equal true values since using full rank)
mvp.xy <- exp(mvp.x%*%mvp.y)
Prediction using h2o.predict (should equal true values since full rank)
mvp.pred <- as.matrix(h2o.predict(mvp.glrm, mvp.hex))
h2o.predict is systematically off by 1 for poisson loss
rbind(truth = sampleData[,1], h2o.predict = mvp.pred[,1],manual = mvp.xy[,1])
Likely there is a hard coded problem
I suspect it was implemented for issues that occur when data == 0
Fix: adjust h2o.predict() for Poisson loss
Alternatively Fix: Zero problem likely deals with 0*log(0), which can be corrected in
the limit using something similar to ifelse(a == 0, 1, a*log(a))
h2o.shutdown(prompt = FALSE)