h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.9k stars 2k forks source link

Intermittent: DeepLearningCheckpointReporting.run #15340

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

DeepLearningCheckpointReporting.run passes locally but fails on Jenkins with the following error:

1) run(hex.deeplearning.DeepLearningCheckpointReporting) java.lang.RuntimeException: java.lang.RuntimeException: java.lang.AssertionError: Duration must increase: 6.352 -> 4.917 at hex.deeplearning.DeepLearningCheckpointReporting.run(DeepLearningCheckpointReporting.java:148) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

exalate-issue-sync[bot] commented 1 year ago

Arno Candel commented: Repro with debugging output: https://0xdata.atlassian.net/secure/attachment/22515/console.txt

exalate-issue-sync[bot] commented 1 year ago

Arno Candel commented: A different failure also attached: https://0xdata.atlassian.net/secure/attachment/22516/console.txt

exalate-issue-sync[bot] commented 1 year ago

Arno Candel commented: Pushed an improved scheme for reporting times/duration. The test can still fail if the run is unusually slowed down, e.g., with a heavily overloaded system.

exalate-issue-sync[bot] commented 1 year ago

Arno Candel commented: {code} 12-13 17:04:36.196 172.17.2.164:44008 24755 FJ-0-41 INFO: Time taken for scoring and diagnostics: 0.716 sec 12-13 17:04:36.197 172.17.2.164:44008 24755 FJ-0-41 INFO: Training time: 2.174 sec (scoring: 1.054 sec). Processed 760 samples (2.000 epochs). 12-13 17:04:36.197 172.17.2.164:44008 24755 FJ-0-41 INFO: Iterations: 2. Epochs: 2.00000. Speed: 678 samples/sec. Estimated time left: 4.281 sec 12-13 17:04:36.821 172.17.2.164:44008 24755 FJ-0-41 INFO: Scoring the model. 12-13 17:04:36.845 172.17.2.164:44008 24755 FJ-0-41 INFO: Status of Neuron Layers (predicting CAPSULE, 2-class classification, bernoulli distribution, CrossEntropy loss, 42,402 weights/biases, 505.5 KB, 1,140 training samples, mini-batch size 1): 12-13 17:04:36.845 172.17.2.164:44008 24755 FJ-0-41 INFO: Layer Units Type Dropout L1 L2 Mean Rate Rate RMS Momentum Mean Weight Weight RMS Mean Bias Bias RMS 12-13 17:04:36.845 172.17.2.164:44008 24755 FJ-0-41 INFO: 1 8 Input 0.00 %
12-13 17:04:36.845 172.17.2.164:44008 24755 FJ-0-41 INFO: 2 200 Rectifier 0.00 % 0.000000 0.000000 0.002839 0.001625 0.000000 -0.001402 0.096987 0.466778 0.022235 12-13 17:04:36.845 172.17.2.164:44008 24755 FJ-0-41 INFO: 3 200 Rectifier 0.00 % 0.000000 0.000000 0.017704 0.068269 0.000000 -0.002617 0.069410 0.995127 0.007523 12-13 17:04:36.845 172.17.2.164:44008 24755 FJ-0-41 INFO: 4 2 Softmax 0.000000 0.000000 0.001589 0.001627 0.000000 -0.007182 0.392470 -0.000000 0.002736 12-13 17:04:37.607 172.17.2.164:44008 24755 FJ-0-41 INFO: Confusion Matrix (vertical: actual; across: predicted): 12-13 17:04:37.607 172.17.2.164:44008 24755 FJ-0-41 INFO: 0 1 Error Rate 12-13 17:04:37.607 172.17.2.164:44008 24755 FJ-0-41 INFO: 0 141 86 0.3789 = 86 / 227 12-13 17:04:37.607 172.17.2.164:44008 24755 FJ-0-41 INFO: 1 22 131 0.1438 = 22 / 153 12-13 17:04:37.607 172.17.2.164:44008 24755 FJ-0-41 INFO: Totals 163 217 0.2842 = 108 / 380 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: Model Metrics Type: Binomial 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: Description: Metrics reported on full training frame 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: model id: first_model 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: frame id: _bfedf02cb29eb07ef92040e4018a4ac3.first_model.tmptrain.chks1 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: MSE: 0.22551608 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: R^2: 0.0623788 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: AUC: 0.8140278 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: logloss: 0.8109188 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: CM: Confusion Matrix (vertical: actual; across: predicted): 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 0 1 Error Rate 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 0 141 86 0.3789 = 86 / 227 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 1 22 131 0.1438 = 22 / 153 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: Totals 163 217 0.2842 = 108 / 380 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: Gains/Lift Table (Avg response rate: 40.26 %): 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: Group Cumulative Data Fraction Lower Threshold Lift Cumulative Lift Response Rate Cumulative Response Rate Capture Rate Cumulative Capture Rate Gain Cumulative Gain 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 1 0.01052632 0.999996 2.483660 2.483660 1.000000 1.000000 0.026144 0.026144 148.366013 148.366013 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 2 0.02105263 0.999952 2.483660 2.483660 1.000000 1.000000 0.026144 0.052288 148.366013 148.366013 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 3 0.03157895 0.999925 1.862745 2.276688 0.750000 0.916667 0.019608 0.071895 86.274510 127.668845 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 4 0.04210526 0.999838 2.483660 2.328431 1.000000 0.937500 0.026144 0.098039 148.366013 132.843137 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 5 0.05000000 0.999815 2.483660 2.352941 1.000000 0.947368 0.019608 0.117647 148.366013 135.294118 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 6 0.10000000 0.998915 2.091503 2.222222 0.842105 0.894737 0.104575 0.222222 109.150327 122.222222 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 7 0.15000000 0.992854 1.699346 2.047930 0.684211 0.824561 0.084967 0.307190 69.934641 104.793028 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 8 0.20000000 0.982534 1.830065 1.993464 0.736842 0.802632 0.091503 0.398693 83.006536 99.346405 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 9 0.30000000 0.950916 1.633987 1.873638 0.657895 0.754386 0.163399 0.562092 63.398693 87.363834 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 10 0.40000000 0.820674 1.045752 1.666667 0.421053 0.671053 0.104575 0.666667 4.575163 66.666667 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 11 0.50000000 0.656706 0.915033 1.516340 0.368421 0.610526 0.091503 0.758170 -8.496732 51.633987 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 12 0.60000000 0.438503 1.045752 1.437908 0.421053 0.578947 0.104575 0.862745 4.575163 43.790850 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 13 0.70000000 0.230939 0.588235 1.316527 0.236842 0.530075 0.058824 0.921569 -41.176471 31.652661 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 14 0.80000000 0.107822 0.522876 1.217320 0.210526 0.490132 0.052288 0.973856 -47.712418 21.732026 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 15 0.90000000 0.019449 0.196078 1.103849 0.078947 0.444444 0.019608 0.993464 -80.392157 10.384895 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 16 1.00000000 0.000000 0.065359 1.000000 0.026316 0.402632 0.006536 1.000000 -93.464052 0.000000 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: Status of Neuron Layers (predicting CAPSULE, 2-class classification, bernoulli distribution, CrossEntropy loss, 42,402 weights/biases, 505.5 KB, 1,140 training samples, mini-batch size 1): 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: Layer Units Type Dropout L1 L2 Mean Rate Rate RMS Momentum Mean Weight Weight RMS Mean Bias Bias RMS 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 1 8 Input 0.00 %
12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 2 200 Rectifier 0.00 % 0.000000 0.000000 0.002839 0.001625 0.000000 -0.001402 0.096987 0.466778 0.022235 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 3 200 Rectifier 0.00 % 0.000000 0.000000 0.017704 0.068269 0.000000 -0.002617 0.069410 0.995127 0.007523 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 4 2 Softmax 0.000000 0.000000 0.001589 0.001627 0.000000 -0.007182 0.392470 -0.000000 0.002736 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: Scoring History: 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: Timestamp Duration Training Speed Epochs Iterations Samples Training MSE Training R^2 Training LogLoss Training AUC Training Lift Training Classification Error 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 2015-12-13 17:04:34 0.000 sec 0.00000 0 0.000000 NaN NaN NaN NaN NaN NaN 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 2015-12-13 17:04:34 0.833 sec 767 rows/sec 1.00000 1 380.000000 0.21623 0.10101 0.80595 0.81172 2.48366 0.26053 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 2015-12-13 17:04:35 2.143 sec 697 rows/sec 2.00000 2 760.000000 0.21565 0.10339 0.86859 0.81858 2.48366 0.24474 12-13 17:04:37.674 172.17.2.164:44008 24755 FJ-0-41 INFO: 2015-12-13 17:04:36 3.615 sec 654 rows/sec 3.00000 3 1140.000000 0.22552 0.06238 0.81092 0.81403 2.48366 0.28421 12-13 17:04:37.676 172.17.2.164:44008 24755 FJ-0-41 INFO: Time taken for scoring and diagnostics: 0.818 sec 12-13 17:04:37.678 172.17.2.164:44008 24755 FJ-0-41 INFO: Training time: 3.654 sec (scoring: 1.872 sec). Processed 1,140 samples (3.000 epochs). 12-13 17:04:37.679 172.17.2.164:44008 24755 FJ-0-41 INFO: Iterations: 3. Epochs: 3.00000. Speed: 639 samples/sec. Estimated time left: 2.797 sec 12-13 17:04:39.564 172.17.2.164:44008 24755 FJ-1-41 WARN: 4 node(s) (out of 5) are not contributing to model updates. Consider setting replicate_training_data to true or using a larger training dataset (or fewer H2O nodes). 12-13 17:04:39.566 172.17.2.164:44008 24755 FJ-0-41 INFO: Scoring the model. 12-13 17:04:39.570 172.17.2.164:44008 24755 FJ-0-41 INFO: Status of Neuron Layers (predicting CAPSULE, 2-class classification, bernoulli distribution, CrossEntropy loss, 42,402 weights/biases, 505.5 KB, 1,520 training samples, mini-batch size 1): 12-13 17:04:39.570 172.17.2.164:44008 24755 FJ-0-41 INFO: Layer Units Type Dropout L1 L2 Mean Rate Rate RMS Momentum Mean Weight Weight RMS Mean Bias Bias RMS 12-13 17:04:39.570 172.17.2.164:44008 24755 FJ-0-41 INFO: 1 8 Input 0.00 %
12-13 17:04:39.570 172.17.2.164:44008 24755 FJ-0-41 INFO: 2 200 Rectifier 0.00 % 0.000000 0.000000 0.002844 0.001607 0.000000 -0.001722 0.096428 0.458234 0.025949 12-13 17:04:39.570 172.17.2.164:44008 24755 FJ-0-41 INFO: 3 200 Rectifier 0.00 % 0.000000 0.000000 0.019408 0.073634 0.000000 -0.003093 0.069429 0.994042 0.008598 12-13 17:04:39.570 172.17.2.164:44008 24755 FJ-0-41 INFO: 4 2 Softmax 0.000000 0.000000 0.001614 0.001631 0.000000 -0.007182 0.390556 -0.000000 0.003087 12-13 17:05:00.027 172.17.2.164:44008 24755 FJ-0-41 INFO: Confusion Matrix (vertical: actual; across: predicted): 12-13 17:05:00.027 172.17.2.164:44008 24755 FJ-0-41 INFO: 0 1 Error Rate 12-13 17:05:00.027 172.17.2.164:44008 24755 FJ-0-41 INFO: 0 162 65 0.2863 = 65 / 227 12-13 17:05:00.027 172.17.2.164:44008 24755 FJ-0-41 INFO: 1 32 121 0.2092 = 32 / 153 12-13 17:05:00.027 172.17.2.164:44008 24755 FJ-0-41 INFO: Totals 194 186 0.2553 = 97 / 380 12-13 17:05:00.072 172.17.2.164:44008 24755 FJ-0-41 INFO: Time taken for scoring and diagnostics: 20.502 sec {code}

Failed because scoring suddenly took way too long: 12-13 17:04:37.676 172.17.2.164:44008 24755 FJ-0-41 INFO: Time taken for scoring and diagnostics: 0.818 sec 12-13 17:05:00.072 172.17.2.164:44008 24755 FJ-0-41 INFO: Time taken for scoring and diagnostics: 20.502 sec

{code} 12-13 17:05:04.996 172.17.2.164:44008 24755 FJ-0-19 INFO: Scoring History: 12-13 17:05:04.996 172.17.2.164:44008 24755 FJ-0-19 INFO: Timestamp Duration Training Speed Epochs Iterations Samples Training MSE Training R^2 Training LogLoss Training AUC Training Lift Training Classification Error 12-13 17:05:04.996 172.17.2.164:44008 24755 FJ-0-19 INFO: 2015-12-13 17:04:34 0.000 sec 0.00000 0 0.000000 NaN NaN NaN NaN NaN NaN 12-13 17:05:04.996 172.17.2.164:44008 24755 FJ-0-19 INFO: 2015-12-13 17:04:34 0.833 sec 767 rows/sec 1.00000 1 380.000000 0.21623 0.10101 0.80595 0.81172 2.48366 0.26053 12-13 17:05:04.996 172.17.2.164:44008 24755 FJ-0-19 INFO: 2015-12-13 17:04:35 2.143 sec 697 rows/sec 2.00000 2 760.000000 0.21565 0.10339 0.86859 0.81858 2.48366 0.24474 12-13 17:05:04.996 172.17.2.164:44008 24755 FJ-0-19 INFO: 2015-12-13 17:04:36 3.615 sec 654 rows/sec 3.00000 3 1140.000000 0.22552 0.06238 0.81092 0.81403 2.48366 0.28421 12-13 17:05:04.996 172.17.2.164:44008 24755 FJ-0-19 INFO: 2015-12-13 17:04:39 26.045 sec 414 rows/sec 4.00000 4 1520.000000 0.19696 0.18111 0.68145 0.82387 2.48366 0.25526 12-13 17:05:04.996 172.17.2.164:44008 24755 FJ-0-19 INFO: 2015-12-13 17:05:04 7.308 sec -120 rows/sec 5.00000 5 1900.000000 0.18540 0.22916 0.60210 0.83326 2.48366 0.23947 12-13 17:05:04.998 172.17.2.164:44008 24755 FJ-0-19 INFO: Time taken for scoring and diagnostics: 0.741 sec java.lang.AssertionError at hex.deeplearning.DeepLearningModel.progressUpdate(DeepLearningModel.java:694) at hex.deeplearning.DeepLearningModel.doScoring(DeepLearningModel.java:682) at hex.deeplearning.DeepLearning$DeepLearningDriver.trainModel(DeepLearning.java:421) at hex.deeplearning.DeepLearning$DeepLearningDriver.buildModel(DeepLearning.java:284) at hex.deeplearning.DeepLearning$DeepLearningDriver.compute2(DeepLearning.java:191) at water.H2O$H2OCountedCompleter.compute(H2O.java:1057) at jsr166y.CountedCompleter.exec(CountedCompleter.java:468) at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974) at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477) at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) {code}

exalate-issue-sync[bot] commented 1 year ago

Arno Candel commented: Now it's still misreporting speed when training is slow: 365 vs 726 for the same snapshot. http://mr-0xa1:8080/view/arno/job/arno_gradle_build/2993/consoleText

{code} 12-14 11:32:25.284 172.17.2.164:44008 17428 FJ-0-15 INFO: Scoring History: 12-14 11:32:25.284 172.17.2.164:44008 17428 FJ-0-15 INFO: Timestamp Duration Training Speed Epochs Iterations Samples Training MSE Training R^2 Training LogLoss Training AUC Training Lift Training Classification Error 12-14 11:32:25.284 172.17.2.164:44008 17428 FJ-0-15 INFO: 2015-12-14 11:32:16 0.000 sec 0.00000 0 0.000000 NaN NaN NaN NaN NaN NaN 12-14 11:32:25.284 172.17.2.164:44008 17428 FJ-0-15 INFO: 2015-12-14 11:32:16 0.750 sec 518 rows/sec 1.00000 1 380.000000 0.21623 0.10101 0.80595 0.81172 2.48366 0.26053 12-14 11:32:25.284 172.17.2.164:44008 17428 FJ-0-15 INFO: 2015-12-14 11:32:17 1.408 sec 686 rows/sec 2.00000 2 760.000000 0.21565 0.10339 0.86859 0.81858 2.48366 0.24474 12-14 11:32:25.284 172.17.2.164:44008 17428 FJ-0-15 INFO: 2015-12-14 11:32:17 1.883 sec 837 rows/sec 3.00000 3 1140.000000 0.22552 0.06238 0.81092 0.81403 2.48366 0.28421 12-14 11:32:25.284 172.17.2.164:44008 17428 FJ-0-15 INFO: 2015-12-14 11:32:18 2.354 sec 868 rows/sec 4.00000 4 1520.000000 0.19696 0.18111 0.68145 0.82387 2.48366 0.25526 12-14 11:32:25.284 172.17.2.164:44008 17428 FJ-0-15 INFO: 2015-12-14 11:32:22 6.117 sec 365 rows/sec 5.00000 5 1900.000000 0.18540 0.22916 0.60210 0.83326 2.48366 0.23947 12-14 11:32:25.286 172.17.2.164:44008 17428 FJ-0-15 INFO: Time taken for scoring and diagnostics: 2.612 sec 12-14 11:32:25.298 172.17.2.164:44008 17428 FJ-0-15 INFO: Training time: 6.138 sec (scoring: 3.286 sec). Processed 1,900 samples (5.000 epochs). 12-14 11:32:25.299 172.17.2.164:44008 17428 FJ-0-15 INFO: Iterations: 5. Epochs: 5.00000. Speed: 726 samples/sec. {code}

DinukaH2O commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-2435 Assignee: Arno Candel Reporter: Ludi Rehak State: Resolved Fix Version: N/A Attachments: Available (Count: 5) Development PRs: N/A

Attachments From Jira

Attachment Name: console.txt Attached By: Arno Candel File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-2435/console.txt

Attachment Name: console.txt Attached By: Arno Candel File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-2435/console.txt

Attachment Name: console3.txt Attached By: Arno Candel File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-2435/console3.txt

Attachment Name: Screen Shot 2015-12-08 at 5.56.14 PM.png Attached By: Arno Candel File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-2435/Screen Shot 2015-12-08 at 5.56.14 PM.png

Attachment Name: Screen Shot 2015-12-08 at 5.57.04 PM.png Attached By: Arno Candel File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-2435/Screen Shot 2015-12-08 at 5.57.04 PM.png