h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.9k stars 2k forks source link

Null Pointer/ Assertion on export weights #12101

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Problem: When setting export_weights_and_biases=True in python h2o a null pointer exception or assertion error occurs.

Temp Fix: Disabling the exporting of weights and biases makes the error not occur

{{{quote}Checking whether there is an H2O instance running at http://localhost:54321..... not found. Attempting to start a local H2O server... Java Version: openjdk version "1.8.0_151"; OpenJDK Runtime Environment (build 1.8.0151-8u151-b12-0ubuntu0.16.04.2-b12); OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode) Starting server from /usr/local/lib/python2.7/dist-packages/h2o/backend/bin/h2o.jar Ice root: /tmp/tmpSMnFK JVM stdout: /tmp/tmpSMnFK_/h2o_ubuntu_started_frompython.out JVM stderr: /tmp/tmpSMnFK/h2o_ubuntu_started_from_python.err Server is running at http://127.0.0.1:54321 Connecting to H2O server at http://127.0.0.1:54321... successful.


H2O cluster uptime: 02 secs H2O cluster version: 3.16.0.3 H2O cluster version age: 8 days H2O cluster name: H2O_from_python_ubuntu_btslvo H2O cluster total nodes: 1 H2O cluster free memory: 26.67 Gb H2O cluster total cores: 32 H2O cluster allowed cores: 32 H2O cluster status: accepting new members, healthy H2O connection url: http://127.0.0.1:54321 H2O connection proxy: H2O internal security: False H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4 Python version: 2.7.12 final


Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100% deeplearning Model Build progress: |████████████████████████████▊ (failed) | 49% Traceback (most recent call last): File "./project/python-scripts/scripts/runner.py", line 13, in dl_model.train_model(validate=False) File "/home/ubuntu/project/python-scripts/models/h2o_deep_learning.py", line 64, in train_model self.model.train(x=h2o_training.names[:-1], y=h2o_training.names[-1], training_frame=h2o_training) File "/usr/local/lib/python2.7/dist-packages/h2o/estimators/estimator_base.py", line 208, in train model.poll(verbose_model_scoring_history=verbose) File "/usr/local/lib/python2.7/dist-packages/h2o/job.py", line 77, in poll "\n{}".format(self.job_key, self.exception, self.job["stacktrace"])) EnvironmentError: Job with key $03017f00000132d4ffffffff$_8b10a08c73ca59324ab104f391084dfc failed with an exception: java.lang.NullPointerException stacktrace: java.lang.NullPointerException at water.Key.compareTo(Key.java:482) at java.util.Arrays.binarySearch0(Arrays.java:2439) at java.util.Arrays.binarySearch(Arrays.java:2379) at water.Scope.exit(Scope.java:45) at hex.deeplearning.DeepLearning$DeepLearningDriver.buildModel(DeepLearning.java:332) at hex.deeplearning.DeepLearning$DeepLearningDriver.computeImpl(DeepLearning.java:216) at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:206) at hex.deeplearning.DeepLearning$DeepLearningDriver.compute2(DeepLearning.java:209) at water.H2O$H2OCountedCompleter.compute(H2O.java:1263) at jsr166y.CountedCompleter.exec(CountedCompleter.java:468) at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974) at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477) at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

H2O session _sid_90b6 closed.

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Hello [~accountid:557058:c26fc9e0-bc6b-4b02-a1c7-7c9d53fb6d99], we're working on reproducing your issue. Could you please run H2O with assertions enabled ? From Python, you can do it by using the enable_assertions=True argument in the h2o.init(...) method.

This should produce a much more detailed output.

Thank you, Pavel

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: Just a note. From the original script provided by [~accountid:557058:c26fc9e0-bc6b-4b02-a1c7-7c9d53fb6d99] ( [^train_model.py] ), the following code has been extracted:

{code:python} training_data = h2o.import_file("/home/pavel/training.csv")

if 'label' in training_data.names:
    training_data['label'] = training_data['label'].asfactor()
else:
    raise AttributeError("label {0} not found".format('label'))

estimator = h2o.estimators.deeplearning.H2ODeepLearningEstimator(hidden=[50, 50, 50, 50, 50],
                                                                 activation='rectifier',
                                                                 adaptive_rate=True,
                                                                 balance_classes=True,
                                                                 epochs=50,
                                                                 shuffle_training_data=True,
                                                                 score_each_iteration=True,
                                                                 stopping_metric='auc',
                                                                 stopping_rounds=5,
                                                                 stopping_tolerance=.01,
                                                                 use_all_factor_levels=False,
                                                                 variable_importances=False,
                                                                 export_weights_and_biases=True,
                                                                 seed=200)
estimator.train(x=training_data.names[:-1], y=training_data.names[-1], training_frame=training_data)

{code}

This is a simplification over the script provided (the part with validation data was not triggered according to logs provided by the reporter).

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: In Scope class, on line 47, there is a check for the key to be null. The problems occurs on the line before, where binary search is done. The null check should be done before the binary search ( int found = Arrays.binarySearch(arrkeep, key); ). However, this is not the root cause. Skipping after an empty key is found during exiting a scope does not prevent leaking memory.

We're waiting for the client to reproduce this behavior once again with assertions enabled as we're unable to reproduce the issue with the training data provided.

exalate-issue-sync[bot] commented 1 year ago

James Schneider commented: Console Output with enable assertions, I've also attached the java standard out from h2o, the standard err had no data.

Checking whether there is an H2O instance running at http://localhost:54321..... not found. Attempting to start a local H2O server... Java Version: openjdk version "1.8.0_151"; OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12); OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode) Starting server from /usr/local/lib/python2.7/dist-packages/h2o/backend/bin/h2o.jar Ice root: /tmp/tmpLPDDE6 JVM stdout: /tmp/tmpLPDDE6/h2o_ubuntu_started_from_python.out JVM stderr: /tmp/tmpLPDDE6/h2o_ubuntu_started_from_python.err Server is running at http://127.0.0.1:54321 Connecting to H2O server at http://127.0.0.1:54321... successful.


H2O cluster uptime: 03 secs H2O cluster version: 3.16.0.3 H2O cluster version age: 16 days H2O cluster name: H2O_from_python_ubuntu_rralln H2O cluster total nodes: 1 H2O cluster free memory: 13.98 Gb H2O cluster total cores: 32 H2O cluster allowed cores: 32 H2O cluster status: accepting new members, healthy H2O connection url: http://127.0.0.1:54321 H2O connection proxy: H2O internal security: False H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4 Python version: 2.7.12 final


Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100% deeplearning Model Build progress: |██████████▉ (failed) | 18% Traceback (most recent call last): File "./project/python-scripts/scripts/runner.py", line 46, in dl_model.train_model(validate=False) File "/home/ubuntu/project/python-scripts/models/h2o_deep_learning.py", line 62, in train_model self.model.train(x=h2o_training.names[:-1], y=h2o_training.names[-1], training_frame=h2o_training) File "/usr/local/lib/python2.7/dist-packages/h2o/estimators/estimator_base.py", line 208, in train model.poll(verbose_model_scoring_history=verbose) File "/usr/local/lib/python2.7/dist-packages/h2o/job.py", line 77, in poll "\n{}".format(self.job_key, self.exception, self.job["stacktrace"])) EnvironmentError: Job with key $03017f00000132d4ffffffff$_865d9770189f82c14e7aa73a2bfe4472 failed with an exception: java.lang.AssertionError stacktrace: java.lang.AssertionError at water.Key.compareTo(Key.java:481) at java.util.Arrays.binarySearch0(Arrays.java:2439) at java.util.Arrays.binarySearch(Arrays.java:2379) at water.Scope.exit(Scope.java:45) at hex.deeplearning.DeepLearning$DeepLearningDriver.buildModel(DeepLearning.java:332) at hex.deeplearning.DeepLearning$DeepLearningDriver.computeImpl(DeepLearning.java:216) at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:206) at hex.deeplearning.DeepLearning$DeepLearningDriver.compute2(DeepLearning.java:209) at water.H2O$H2OCountedCompleter.compute(H2O.java:1263) [^h2o_ubuntu_started_from_python.out] at jsr166y.CountedCompleter.exec(CountedCompleter.java:468) at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974) at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477) at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: We're actively working on this issue [~accountid:557058:c26fc9e0-bc6b-4b02-a1c7-7c9d53fb6d99]. Thank you for your help with such detailed output. It helps us a lot to reproduce your environment.

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: The issue is only regarding big data with more partitions, e.g. it can be reproduced on our airlines_all.05p.csv from BigData (Laptop version).

exalate-issue-sync[bot] commented 1 year ago

Pavel Pscheidl commented: In SampleFrame method, a Frame may be left without a key if the original frame does not have one:

{code:java} public static Frame sampleFrame(Frame fr, final long rows, final long seed) { if (fr == null) return null; final float fraction = rows > 0 ? (float)rows / fr.numRows() : 1.f; if (fraction >= 1.f) return fr; Key newKey = fr._key != null ? Key.make(fr._key.toString() + (fr._key.toString().contains("temporary") ? ".sample." : ".temporary.sample.") + PrettyPrint.formatPct(fraction).replace(" ","")) : null; {code}

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5229 Assignee: Pavel Pscheidl Reporter: James Schneider State: Resolved Fix Version: 3.18.0.1 Attachments: Available (Count: 4) Development PRs: Available

Linked PRs from JIRA

https://github.com/h2oai/h2o-3/pull/2011

Attachments From Jira

Attachment Name: h2o_ubuntu_started_from_python.out Attached By: James Schneider File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5229/h2o_ubuntu_started_from_python.out

Attachment Name: testing.csv Attached By: James Schneider File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5229/testing.csv

Attachment Name: train_model.py Attached By: James Schneider File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5229/train_model.py

Attachment Name: training.csv Attached By: James Schneider File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5229/training.csv