h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.85k stars 1.99k forks source link

h2o.predict() has a systematic bug for Possion loss #10361

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Following script shows this problem:

----------------------------------------------------------------------

Functions to generate data from multivariate Poisson distribution:

----------------------------------------------------------------------

rMVpois = function(n, A, sigmaY){ p = ncol(A) if(!is.matrix(sigmaY)){sigmaY = diag(sigmaY)} q = nrow(sigmaY) if(p != q){cat("Inputs A and sigmaY are not compatible.")} if(p == q){ indep.pois = matrix(rpois(n = np, lambda = diag(sigmaY)), ncol = n) return(A%%indep.pois) } } is.wholenumber <-function(x, tol = .Machine$double.eps^0.5){abs(x - round(x)) < tol} make.A = function(p, q){ A = NULL if(!is.wholenumber(p) | !is.wholenumber(q)){cat("ERROR: p and q must be non-negative integers.\n")} if(p <2){cat("ERROR: p must be greater than 1.\nTry ?rpois() if p = 1.")} if(p >=2 & is.wholenumber(p) & is.wholenumber(q)){ A.diag = diag(1, p) q.test = c(0, 1)

Independent:

if(q == q.test[1]){A = A.diag}
# Constant Covariance:
if(q == q.test[2]){A = cbind(A.diag, matrix(1, nrow=p, ncol=1))}
if(p == 2 & !(q%in%q.test)){cat("ERROR: q must be either 0 or 1 when p = 2.")}
if(p >2){
  q2.test = c(choose(p,2), choose(p,2) + choose(p,3))
  # Two Way Covariance:
  if(q %in% q2.test[1:2]){
    A.two = matrix(0,ncol=q2.test[1],nrow=p)
    comb.mat = combn(p,2)
    for(pp in 1:p){A.two[pp,] = apply(comb.mat,2, function(x) ifelse(any(x == pp),1,0))}
    A = cbind(A.diag, A.two)
  }
  # Three Way Covariance:
  if(q == q2.test[2]){
    A.three = matrix(0,ncol=q2.test[2] - q2.test[1],nrow=p)
    comb.mat = combn(p,3)
    for(pp in 1:p){A.three[pp,] = apply(comb.mat,2, function(x) ifelse(any(x == pp),1,0))}
    A = cbind(A, A.three)
  }
  if( !(q %in% c(0,1,q2.test))){ cat("ERROR: q must be either 0, 1,", q2.test[1],", or", q2.test[2],"when p =", p,"\n")}
}
if(!is.null(A)){return(A)}

} }

----------------------------------------------------------------------

--------------------------------------------

Generate sample multivariate Possion data:

--------------------------------------------

set.seed(12345) sampleData = t(rMVpois(n = 1000, A = make.A(p=5,q=10), sigmaY = c(rep(20,5),1:10)))

sample structure:

cov(sampleData)

population structure:

make.A(p=5,q=10)%%diag(c(rep(20,5),rep(5,10)))%%t(make.A(p=5,q=10))

-------------------------------------

Run H20 analysis

-------------------------------------

library(h2o) h2o.init()

Create PCA full rank solution using log of data:

pca.hex <- as.h2o(log(sampleData), destination_frame="pca.hex") pca.glrm <- h2o.glrm(training_frame = pca.hex, k = 5, loss = "Quadratic", transform = "NONE", regularization_x = "None", regularization_y = "None", max_iterations = 1000, init = "SVD")

Set initial solution to pca solution:

pca.y <- as.matrix(pca.glrm@model$archetypes) pca.x <- as.matrix(h2o.getFrame(pca.glrm@model$representation_name))

Using full rank Poisson loss should equal full rank PCA solution:

k = 5 mvp.hex <- as.h2o(sampleData, destination_frame="mvp.hex") mvp.glrm <- h2o.glrm(training_frame = mvp.hex, k = k, loss = "Poisson", transform = "NONE", regularization_x = "None", regularization_y = "None",
init = "User", user_y = pca.y[1:k,], user_x = pca.x[,1:k], max_iterations = 1000)

Extract components:

mvp.y <- as.matrix(mvp.glrm@model$archetypes) mvp.x <- as.matrix(h2o.getFrame(mvp.glrm@model$representation_name))

Manual prediction (should equal true values since using full rank)

mvp.xy <- exp(mvp.x%*%mvp.y)

Prediction using h2o.predict (should equal true values since full rank)

mvp.pred <- as.matrix(h2o.predict(mvp.glrm, mvp.hex))

h2o.predict is systematically off by 1 for poisson loss

rbind(truth = sampleData[,1], h2o.predict = mvp.pred[,1],manual = mvp.xy[,1])

Likely there is a hard coded problem

I suspect it was implemented for issues that occur when data == 0

Fix: adjust h2o.predict() for Poisson loss

Alternatively Fix: Zero problem likely deals with 0*log(0), which can be corrected in

the limit using something similar to ifelse(a == 0, 1, a*log(a))

h2o.shutdown(prompt = FALSE)

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-3449 Assignee: New H2O Bugs Reporter: Avkash Chauhan State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A