Closed exalate-issue-sync[bot] closed 1 year ago
Pavel Pscheidl commented: Hello [~accountid:557058:c26fc9e0-bc6b-4b02-a1c7-7c9d53fb6d99], we're working on reproducing your issue. Could you please run H2O with assertions enabled ? From Python, you can do it by using the enable_assertions=True argument in the h2o.init(...) method.
This should produce a much more detailed output.
Thank you, Pavel
Pavel Pscheidl commented: Just a note. From the original script provided by [~accountid:557058:c26fc9e0-bc6b-4b02-a1c7-7c9d53fb6d99] ( [^train_model.py] ), the following code has been extracted:
{code:python} training_data = h2o.import_file("/home/pavel/training.csv")
if 'label' in training_data.names:
training_data['label'] = training_data['label'].asfactor()
else:
raise AttributeError("label {0} not found".format('label'))
estimator = h2o.estimators.deeplearning.H2ODeepLearningEstimator(hidden=[50, 50, 50, 50, 50],
activation='rectifier',
adaptive_rate=True,
balance_classes=True,
epochs=50,
shuffle_training_data=True,
score_each_iteration=True,
stopping_metric='auc',
stopping_rounds=5,
stopping_tolerance=.01,
use_all_factor_levels=False,
variable_importances=False,
export_weights_and_biases=True,
seed=200)
estimator.train(x=training_data.names[:-1], y=training_data.names[-1], training_frame=training_data)
{code}
This is a simplification over the script provided (the part with validation data was not triggered according to logs provided by the reporter).
Pavel Pscheidl commented: In Scope class, on line 47, there is a check for the key to be null. The problems occurs on the line before, where binary search is done. The null check should be done before the binary search ( int found = Arrays.binarySearch(arrkeep, key); ). However, this is not the root cause. Skipping after an empty key is found during exiting a scope does not prevent leaking memory.
We're waiting for the client to reproduce this behavior once again with assertions enabled as we're unable to reproduce the issue with the training data provided.
James Schneider commented: Console Output with enable assertions, I've also attached the java standard out from h2o, the standard err had no data.
Checking whether there is an H2O instance running at http://localhost:54321..... not found. Attempting to start a local H2O server... Java Version: openjdk version "1.8.0_151"; OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12); OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode) Starting server from /usr/local/lib/python2.7/dist-packages/h2o/backend/bin/h2o.jar Ice root: /tmp/tmpLPDDE6 JVM stdout: /tmp/tmpLPDDE6/h2o_ubuntu_started_from_python.out JVM stderr: /tmp/tmpLPDDE6/h2o_ubuntu_started_from_python.err Server is running at http://127.0.0.1:54321 Connecting to H2O server at http://127.0.0.1:54321... successful.
H2O cluster uptime: 03 secs H2O cluster version: 3.16.0.3 H2O cluster version age: 16 days H2O cluster name: H2O_from_python_ubuntu_rralln H2O cluster total nodes: 1 H2O cluster free memory: 13.98 Gb H2O cluster total cores: 32 H2O cluster allowed cores: 32 H2O cluster status: accepting new members, healthy H2O connection url: http://127.0.0.1:54321 H2O connection proxy: H2O internal security: False H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4 Python version: 2.7.12 final
Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100%
deeplearning Model Build progress: |██████████▉ (failed) | 18%
Traceback (most recent call last):
File "./project/python-scripts/scripts/runner.py", line 46, in
Pavel Pscheidl commented: We're actively working on this issue [~accountid:557058:c26fc9e0-bc6b-4b02-a1c7-7c9d53fb6d99]. Thank you for your help with such detailed output. It helps us a lot to reproduce your environment.
Pavel Pscheidl commented: The issue is only regarding big data with more partitions, e.g. it can be reproduced on our airlines_all.05p.csv from BigData (Laptop version).
Pavel Pscheidl commented: In SampleFrame method, a Frame may be left without a key if the original frame does not have one:
{code:java} public static Frame sampleFrame(Frame fr, final long rows, final long seed) { if (fr == null) return null; final float fraction = rows > 0 ? (float)rows / fr.numRows() : 1.f; if (fraction >= 1.f) return fr; Key newKey = fr._key != null ? Key.make(fr._key.toString() + (fr._key.toString().contains("temporary") ? ".sample." : ".temporary.sample.") + PrettyPrint.formatPct(fraction).replace(" ","")) : null; {code}
JIRA Issue Migration Info
Jira Issue: PUBDEV-5229 Assignee: Pavel Pscheidl Reporter: James Schneider State: Resolved Fix Version: 3.18.0.1 Attachments: Available (Count: 4) Development PRs: Available
Linked PRs from JIRA
https://github.com/h2oai/h2o-3/pull/2011
Attachments From Jira
Attachment Name: h2o_ubuntu_started_from_python.out Attached By: James Schneider File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5229/h2o_ubuntu_started_from_python.out
Attachment Name: testing.csv Attached By: James Schneider File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5229/testing.csv
Attachment Name: train_model.py Attached By: James Schneider File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5229/train_model.py
Attachment Name: training.csv Attached By: James Schneider File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5229/training.csv
Problem: When setting export_weights_and_biases=True in python h2o a null pointer exception or assertion error occurs.
Temp Fix: Disabling the exporting of weights and biases makes the error not occur
{{{quote}Checking whether there is an H2O instance running at http://localhost:54321..... not found. Attempting to start a local H2O server... Java Version: openjdk version "1.8.0_151"; OpenJDK Runtime Environment (build 1.8.0151-8u151-b12-0ubuntu0.16.04.2-b12); OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode) Starting server from /usr/local/lib/python2.7/dist-packages/h2o/backend/bin/h2o.jar Ice root: /tmp/tmpSMnFK JVM stdout: /tmp/tmpSMnFK_/h2o_ubuntu_started_frompython.out JVM stderr: /tmp/tmpSMnFK/h2o_ubuntu_started_from_python.err Server is running at http://127.0.0.1:54321 Connecting to H2O server at http://127.0.0.1:54321... successful.
H2O cluster uptime: 02 secs H2O cluster version: 3.16.0.3 H2O cluster version age: 8 days H2O cluster name: H2O_from_python_ubuntu_btslvo H2O cluster total nodes: 1 H2O cluster free memory: 26.67 Gb H2O cluster total cores: 32 H2O cluster allowed cores: 32 H2O cluster status: accepting new members, healthy H2O connection url: http://127.0.0.1:54321 H2O connection proxy: H2O internal security: False H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4 Python version: 2.7.12 final
Parse progress: |█████████████████████████████████████████████████████████████████████████████| 100% deeplearning Model Build progress: |████████████████████████████▊ (failed) | 49% Traceback (most recent call last): File "./project/python-scripts/scripts/runner.py", line 13, in
dl_model.train_model(validate=False)
File "/home/ubuntu/project/python-scripts/models/h2o_deep_learning.py", line 64, in train_model
self.model.train(x=h2o_training.names[:-1], y=h2o_training.names[-1], training_frame=h2o_training)
File "/usr/local/lib/python2.7/dist-packages/h2o/estimators/estimator_base.py", line 208, in train
model.poll(verbose_model_scoring_history=verbose)
File "/usr/local/lib/python2.7/dist-packages/h2o/job.py", line 77, in poll
"\n{}".format(self.job_key, self.exception, self.job["stacktrace"]))
EnvironmentError: Job with key $03017f00000132d4ffffffff$_8b10a08c73ca59324ab104f391084dfc failed with an exception: java.lang.NullPointerException
stacktrace:
java.lang.NullPointerException
at water.Key.compareTo(Key.java:482)
at java.util.Arrays.binarySearch0(Arrays.java:2439)
at java.util.Arrays.binarySearch(Arrays.java:2379)
at water.Scope.exit(Scope.java:45)
at hex.deeplearning.DeepLearning$DeepLearningDriver.buildModel(DeepLearning.java:332)
at hex.deeplearning.DeepLearning$DeepLearningDriver.computeImpl(DeepLearning.java:216)
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:206)
at hex.deeplearning.DeepLearning$DeepLearningDriver.compute2(DeepLearning.java:209)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1263)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
H2O session _sid_90b6 closed.