h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.92k stars 2k forks source link

Add observation weights to quantile computation #15308

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Arno Candel commented: Step 1: Confirm that quantiles are correct in comparison to R:

{code} library(testthat) library(h2o) h2o.init() df <- h2o.createFrame(missing_fraction = 0, seed=1234) df for (i in c(1,4,5,6,7,9,10)) { h<-h2o.quantile(df[,i], probs=c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0)) r<-quantile(x=as.matrix(as.data.frame(df[,i])), probs=c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0)) expect_equal(r,h) } {code}

Passes.

exalate-issue-sync[bot] commented 1 year ago

Arno Candel commented: Step 2: add weights_column argument to h2o.quantile(). Step 3: Add test for weighted quantiles.

exalate-issue-sync[bot] commented 1 year ago

Arno Candel commented: Question for wtd.quantile in R as used in https://github.com/h2oai/h2o-3/blob/master/h2o-r/tests/testdir_misc/runit_NOPASS_weighted_quantile.R

Does wtd.quantile get the same results for

{code} fr1 = parse_test_file("smalldata/junit/weights_all_twos.csv"); fr2 = parse_test_file("smalldata/junit/weights_all_ones.csv"); {code}

and for

{code} fr1 = parse_test_file("smalldata/junit/no_weights.csv"); fr2 = parse_test_file("smalldata/junit/weights.csv"); {code}

I am referring to https://github.com/h2oai/h2o-3/blob/master/h2o-algos/src/test/java/hex/quantile/QuantileTest.java#L218 and https://github.com/h2oai/h2o-3/blob/master/h2o-algos/src/test/java/hex/quantile/QuantileTest.java#L264, which are both failing currently. Trying to understand whether R gets the exact same quantiles for those cases or not.

exalate-issue-sync[bot] commented 1 year ago

Arno Candel commented: {code} setwd("~") nw <- read.csv("h2o-3/smalldata/junit/no_weights.csv") w <- read.csv("h2o-3/smalldata/junit/weights.csv") w1 <- read.csv("h2o-3/smalldata/junit/weights_all_ones.csv") w2 <- read.csv("h2o-3/smalldata/junit/weights_all_twos.csv") quantile(nw$f1, probs = seq(0,1,.05)) wtd.quantile(w$f1, w$weight, probs = seq(0,1,.05)) wtd.quantile(w$f1, w$weight, probs = seq(0,1,.05), normwt = T) wtd.quantile(w1$f1, w1$weight, probs = seq(0,1,.05)) wtd.quantile(w2$f1, w2$weight, probs = seq(0,1,.05)) wtd.quantile(w2$f1, w2$weight, probs = seq(0,1,.05), normwt = T) {code}

This shows that even wtd.quantile doesn't behave the same way for all w=1 vs all w=2.

exalate-issue-sync[bot] commented 1 year ago

Arno Candel commented: h3. simple test for weights < 1, not quite right as of dbef536dc48fa

{code} probs <- c(0,0.25,0.5,0.75,1)

x <- c(1,2,3,4,5) w <- c(0.5,0.4,0.3,0.2,0.1) wtd.quantile(x,weights=w,probs=probs) wtd.quantile(x,weights=w,normwt = T,probs=probs) wtd.quantile(x, 5:1,probs=probs) y <- c(1,1,1,1,1,2,2,2,2,3,3,3,4,4,5) quantile(y,probs=probs)

library(h2o) h2o.init() x <- c(1,2,3,4,5) w <- c(0.5,0.4,0.3,0.2,0.1) df <- as.h2o(x) df$w <- as.h2o(w) h2o.quantile(df,weights_column="w",probs=probs)

df$w <- df$w*nrow(df)/sum(df$w) h2o.quantile(df,weights_column="w",probs=probs)

df <- as.h2o(x) df$w <- as.h2o(5:1) h2o.quantile(df,weights_column="w",probs=probs)

y <- c(1,1,1,1,1,2,2,2,2,3,3,3,4,4,5) df <- as.h2o(y) h2o.quantile(df, probs=probs) {code}

DinukaH2O commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-2402 Assignee: Arno Candel Reporter: Arno Candel State: Resolved Fix Version: N/A Attachments: N/A Development PRs: N/A