h2oai / h2o-tutorials

Tutorials and training material for the H2O Machine Learning Platform
http://h2o.ai
1.48k stars 1.01k forks source link

h2o.predict issue related to !!!StackedEnsemble!!! h2o version: 3.23.0.4602 #107

Closed turgut090 closed 5 years ago

turgut090 commented 5 years ago
set.seed(23)
aml<-h2o.automl(y=outcome, x= features,
                training_frame = train,
                #validation_frame = test,
                leaderboard_frame = test,seed=3,#max_runtime_secs = 120,
                #exclude_algos = c("StackedEnsemble"))#,
                max_models = 2)

testing=read_csv('../input/test.csv') %>% select(-ID_code)

light_test2 = h2o.predict(aml@leader,testing %>% as.h2o()) %>% as.data.frame()%>% .$p1

H2O VERSION 3.23.0.4602

Loading required package: lubridate

Loading required package: methods

Attaching package: ‘lubridate’

The following object is masked from ‘package:base’:

    date

Loading required package: PerformanceAnalytics

Loading required package: xts

Loading required package: zoo

Attaching package: ‘zoo’

The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric

Attaching package: ‘PerformanceAnalytics’

The following object is masked from ‘package:graphics’:

    legend

Loading required package: quantmod

Loading required package: TTR

Version 0.4-0 included new data defaults. See ?getSymbols.

Loading required package: tidyverse

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──

✔ ggplot2 3.1.0.9000     ✔ purrr   0.3.1     
✔ tibble  2.0.1          ✔ dplyr   0.8.0.1   
✔ tidyr   0.8.3          ✔ stringr 1.4.0     
✔ readr   1.3.1          ✔ forcats 0.4.0     

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ lubridate::as.difftime() masks base::as.difftime()
✖ lubridate::date()        masks base::date()
✖ dplyr::filter()          masks stats::filter()
✖ dplyr::first()           masks xts::first()
✖ lubridate::intersect()   masks base::intersect()
✖ dplyr::lag()             masks stats::lag()
✖ dplyr::last()            masks xts::last()
✖ lubridate::setdiff()     masks base::setdiff()
✖ lubridate::union()       masks base::union()

Loading required package: R6

Attaching package: ‘lightgbm’

The following object is masked from ‘package:dplyr’:

    slice

Type 'citation("pROC")' for a citation.

Attaching package: ‘pROC’

The following objects are masked from ‘package:stats’:

    cov, smooth, var

Parsed with column specification:
cols(
  .default = col_double(),
  ID_code = col_character()
)
See spec(...) for full column specifications.

Loading required package: lattice

Attaching package: ‘caret’

The following object is masked from ‘package:purrr’:

    lift

Warning message:
replacing previous import ‘ggplot2::empty’ by ‘plyr::empty’ when loading ‘caret’ 

numeric(0)

# A tibble: 2 x 15
  Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10 Fold11
   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1  0.896  0.892  0.899  0.900  0.900  0.900 0.903   0.898 0.901  0.906   0.895
2  0.104  0.108  0.101  0.100  0.100  0.100 0.0972  0.102 0.0991 0.0943  0.105
# … with 4 more variables: Fold12 <dbl>, Fold13 <dbl>, Fold14 <dbl>,
#   Fold15 <dbl>

Loading required package: mlr

Loading required package: ParamHelpers

Attaching package: ‘ParamHelpers’

The following object is masked from ‘package:quantmod’:

    getDefaults

Attaching package: ‘mlr’

The following object is masked from ‘package:caret’:

    train

Loading required package: foreach

Attaching package: ‘foreach’

The following objects are masked from ‘package:purrr’:

    accumulate, when

Loading required package: doParallel

Loading required package: iterators

Loading required package: parallel

----------------------------------------------------------------------

Your next step is to start H2O:
    > h2o.init()

For H2O package documentation, ask for help:
    > ??h2o

After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit http://docs.h2o.ai

----------------------------------------------------------------------

Attaching package: ‘h2o’

The following object is masked from ‘package:pROC’:

    var

The following objects are masked from ‘package:lubridate’:

    day, hour, month, week, year

The following objects are masked from ‘package:stats’:

    cor, sd, var

The following objects are masked from ‘package:base’:

    &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
    colnames<-, ifelse, is.character, is.factor, is.numeric, log,
    log10, log1p, log2, round, signif, trunc

H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /tmp/RtmpvEQjuP/h2o_UnknownUser_started_from_r.out
    /tmp/RtmpvEQjuP/h2o_UnknownUser_started_from_r.err

openjdk version "1.8.0_141"
OpenJDK Runtime Environment (build 1.8.0_141-8u141-b15-1~deb9u1-b15)

OpenJDK 64-Bit Server VM (build 25.141-b15, mixed mode)

Starting H2O JVM and connecting: 
.
.
 Connection successful!

R is connected to the H2O cluster: 

    H2O cluster uptime:         2 seconds 222 milliseconds 
    H2O cluster timezone:       Etc/UTC 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.23.0.4602 
    H2O cluster version age:    14 days, 4 hours and 48 minutes  
    H2O cluster name:           H2O_started_from_R_root_zqv228 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   5.67 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4 
    R Version:                  R version 3.4.2 (2017-09-28) 

  |                                                                            

  |                                                                      |   0%

  |                                                                            
  |======================================================================| 100%

[1] 41865   201

  |                                                                            
  |                                                                      |   0%

  |                                                                            

  |                                                                      |   1%

  |                                                                            
  |=                                                                     |   1%

  |                                                                            

  |=                                                                     |   2%

  |                                                                            
  |==                                                                    |   2%

  |                                                                            

  |==                                                                    |   3%

  |                                                                            

  |===                                                                   |   4%

  |                                                                            
  |===                                                                   |   5%

  |                                                                            
  |====                                                                  |   5%

  |                                                                            

  |================================================================      |  92%

  |                                                                            

  |===================================================================   |  96%

  |                                                                            
  |======================================================================| 100%

                                             model_id       auc   logloss
1 StackedEnsemble_BestOfFamily_AutoML_20190325_112022 0.8197364 0.5185754
2    StackedEnsemble_AllModels_AutoML_20190325_112022 0.8197364 0.5185754
3                        XRT_1_AutoML_20190325_112022 0.8041693 0.5892604
4                        DRF_1_AutoML_20190325_112022 0.7969176 0.5914487
  mean_per_class_error      rmse       mse
1            0.2743024 0.4163262 0.1733275
2            0.2743024 0.4163262 0.1733275
3            0.2817554 0.4473749 0.2001443
4            0.2936830 0.4486491 0.2012860
$model_id

[1] "StackedEnsemble_BestOfFamily_AutoML_20190325_112022"

$training_frame
[1] "automl_training_RTMP_sid_92e1_3"

$base_models
$base_models[[1]]
$base_models[[1]]$`__meta`
$base_models[[1]]$`__meta`$schema_version
[1] 3

$base_models[[1]]$`__meta`$schema_name
[1] "ModelKeyV3"

$base_models[[1]]$`__meta`$schema_type
[1] "Key<Model>"

$base_models[[1]]$name
[1] "XRT_1_AutoML_20190325_112022"

$base_models[[1]]$type
[1] "Key<Model>"

$base_models[[1]]$URL
[1] "/3/Models/XRT_1_AutoML_20190325_112022"

$base_models[[2]]
$base_models[[2]]$`__meta`
$base_models[[2]]$`__meta`$schema_version
[1] 3

$base_models[[2]]$`__meta`$schema_name
[1] "ModelKeyV3"

$base_models[[2]]$`__meta`$schema_type
[1] "Key<Model>"

$base_models[[2]]$name
[1] "DRF_1_AutoML_20190325_112022"

$base_models[[2]]$type
[1] "Key<Model>"

$base_models[[2]]$URL
[1] "/3/Models/DRF_1_AutoML_20190325_112022"

$metalearner_nfolds
[1] 5

$seed
[1] 3

$keep_levelone_frame
[1] TRUE

$x
  [1]
 "var_0"   "var_1"   "var_2"   "var_3"   "var_4"   "var_5"   "var_6"  
  [8] "var_7"   "var_8"   "var_9"   "var_10"  "var_11"  "var_12"  "var_13" 
 [15] "var_14"  "var_15"  "var_16"  "var_17"  "var_18"  "var_19"  "var_20" 
 [22] "var_21"  "var_22"  "var_23"  "var_24"  "var_25"  "var_26"  "var_27" 
 [29] "var_28"  "var_29"  "var_30"  "var_31"  "var_32"  "var_33"  "var_34" 
 [36] "var_35"  "var_36"  "var_37"  "var_38"  "var_39"  "var_40"  "var_41" 
 [43] "var_42"  "var_43"  "var_44"  "var_45"  "var_46"  "var_47"  "var_48" 
 [50] "var_49"  "var_50"  "var_51"  "var_52"  "var_53"  "var_54"  "var_55" 
 [57] "var_56"  "var_57"  "var_58"  "var_59"  "var_60"  "var_61"  "var_62" 
 [64] "var_63"  "var_64"  "var_65"  "var_66"  "var_67"  "var_68"  "var_69" 
 [71] "var_70"  "var_71"  "var_72"  "var_73"  "var_74"  "var_75"  "var_76" 
 [78] "var_77"  "var_78"  "var_79"  "var_80"  "var_81"  "var_82"  "var_83" 
 [85] "var_84"  "var_85"  "var_86"  "var_87"  "var_88"  "var_89"  "var_90" 
 [92] "var_91"  "var_92"  "var_93"  "var_94"  "var_95"  "var_96"  "var_97" 
 [99] "var_98"  "var_99"  "var_100" "var_101" "var_102" "var_103" "var_104"
[106] "var_105" "var_106" "var_107" "var_108" "var_109" "var_110" "var_111"
[113] "var_112" "var_113" "var_114" "var_115" "var_116" "var_117" "var_118"
[120] "var_119" "var_120" "var_121" "var_122" "var_123" "var_124" "var_125"
[127] "var_126" "var_127" "var_128" "var_129" "var_130" "var_131" "var_132"
[134] "var_133" "var_134" "var_135" "var_136" "var_137" "var_138" "var_139"
[141] "var_140" "var_141" "var_142" "var_143" "var_144" "var_145" "var_146"
[148] "var_147" "var_148" "var_149" "var_150" "var_151" "var_152" "var_153"
[155] "var_154" "var_155" "var_156" "var_157" "var_158" "var_159" "var_160"
[162] "var_161" "var_162" "var_163" "var_164" "var_165" "var_166" "var_167"
[169] "var_168" "var_169" "var_170" "var_171" "var_172" "var_173" "var_174"
[176] "var_175" "var_176" "var_177" "var_178" "var_179" "var_180" "var_181"
[183] "var_182" "var_183" "var_184" "var_185" "var_186" "var_187" "var_188"
[190] "var_189" "var_190" "var_191" "var_192" "var_193" "var_194" "var_195"
[197] "var_196" "var_197" "var_198" "var_199"

$y
[1] "target"

[1] "/kaggle/working/StackedEnsemble_BestOfFamily_AutoML_20190325_112022"

# A tibble: 4 x 6
  model_id                            auc logloss mean_per_class_er…  rmse   mse
  <chr>                             <dbl>   <dbl>              <dbl> <dbl> <dbl>
1 StackedEnsemble_BestOfFamily_Aut… 0.820   0.519              0.274 0.416 0.173
2 StackedEnsemble_AllModels_AutoML… 0.820   0.519              0.274 0.416 0.173
3 XRT_1_AutoML_20190325_112022      0.804   0.589              0.282 0.447 0.200
4 DRF_1_AutoML_20190325_112022      0.797   0.591              0.294 0.449 0.201

Parsed with column specification:
cols(
  .default = col_double(),
  ID_code = col_character()
)
See spec(...) for full column specifications.

  |                                                                            
  |                                                                      |   0%

  |                                                                            
  |======================================================================| 100%

  |                                                                            
  |                                                                      |   0%

java.lang.IllegalArgumentException: Actual column must be integer class labels!

java.lang.IllegalArgumentException: Actual column must be integer class labels!
    at hex.GainsLift.init(GainsLift.java:51)
    at hex.GainsLift.exec(GainsLift.java:124)
    at hex.glm.GLMMetricBuilder.makeModelMetrics(GLMMetricBuilder.java:217)
    at hex.glm.GLMModel.predictScoreImpl(GLMModel.java:1456)
    at hex.Model.score(Model.java:1381)
    at hex.ensemble.StackedEnsembleModel.predictScoreImpl(StackedEnsembleModel.java:150)
    at hex.Model.score(Model.java:1381)
    at water.api.ModelMetricsHandler$1.compute2(ModelMetricsHandler.java:374)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1386)
    at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
    at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
    at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
    at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
    at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

Error: java.lang.IllegalArgumentException: Actual column must be integer class labels!
Execution halted

I INSTALLED H2O VERSION 3.22.1.6 AND NO ERROR

sebhrusen commented 5 years ago

Hi @henry090, do you mind trying your scenario again with a more recent nightly build? http://h2o-release.s3.amazonaws.com/h2o/master/4617/index.html It have small reasons to think that it may have been fixed by https://0xdata.atlassian.net/browse/PUBDEV-6208.

One quick question though: is ../input/test.csv file different from the test frame you passed as leaderboard?

turgut090 commented 5 years ago

http://h2o-release.s3.amazonaws.com/h2o/master/4617/index.html

Actually, I tested that version, too. h2o-3.23.0.4617 does not work with StackedEnsemble. So, downgrading helped a lot.

file different from the test frame

No, absolutely the same structure. This is why I was confused. Kaggle updated h2o and I am obliged to install previous 3.22 version every time.

Here is the data: https://www.kaggle.com/c/santander-customer-transaction-prediction/data

Another user who faced the same issue: https://stackoverflow.com/questions/55194145/error-when-calling-test-file-in-h2o-predict-function/55337889#55337889

sebhrusen commented 5 years ago

@henry090 , thanks for the dataset, I'll try to reproduce this.

what do you mean exactly by h2o-3.23.0.4617 does not work with StackedEnsemble? Do you get a different issue with that version? or is it the same error?

turgut090 commented 5 years ago

I mean there are an h2o 22 (for example 3.22.1.6) and 23 versions. So, 22 is stable, but versions (3.23.0.4617 and 3.23.0.4602 ) do not work with StackedEnsemble. They are h2o 23rd versions.

sebhrusen commented 5 years ago

@henry090 : I identified the issue, fix should be in nightly quickly. please follow progress there: https://0xdata.atlassian.net/browse/PUBDEV-6376 thx!

sebhrusen commented 5 years ago

closing this ticket. refer to https://0xdata.atlassian.net/browse/PUBDEV-6376 for Jira issue. and PR at https://github.com/h2oai/h2o-3/pull/3382