h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.86k stars 1.99k forks source link

Python - Rulefit training ends with error "Trying to unlock null! (key = rfit1)" #15965

Closed BenUze closed 7 months ago

BenUze commented 9 months ago

Versions H2O version 3.44.0.2 OS : UBUNTU 20.04.6 LTS Python version : 3.11 Java Version: openjdk version "11.0.21" 2023-10-17; OpenJDK Runtime Environment (build 11.0.21+9-post-Ubuntu-0ubuntu120.04); OpenJDK 64-Bit Server VM (build 11.0.21+9-post-Ubuntu-0ubuntu120.04, mixed mode, sharing)

Actual behavior In jupyter notebook (python 3.11), rulefit Model Build ends with "OSError('Job with key $03017f00000132d4ffffffff$_ad3f38a6ff226bfeeae60fe6934c4730 failed with an exception: java.lang.AssertionError: Trying to unlock null! (key = rfit1)" Models are built as part of an optimization of hyperparameter with optuna. It worked before and I can't think of any change made to the OS, my python environment that would generate such issue

Expected behavior I expect the model training to complete normally, as it did before

Steps to reproduce Steps to reproduce the behavior (with working code on a sample dataset, if possible):

  1. Parse H2OFrame from pandas dataframe
  2. Train rulefit model on h2oframe
  3. See error

Error message OSError('Job with key $03017f00000132d4ffffffff$_ad3f38a6ff226bfeeae60fe6934c4730 failed with an exception: java.lang.AssertionError: Trying to unlock null! (key = rfit1) stacktrace: java.lang.AssertionError: Trying to unlock null! (key = rfit1) at water.Lockable$Unlock.atomic(Lockable.java:225) at water.Lockable$Unlock.atomic(Lockable.java:216) at water.TAtomic.atomic(TAtomic.java:18) at water.Atomic.compute2(Atomic.java:56) at water.Atomic.fork(Atomic.java:39) at water.Atomic.invoke(Atomic.java:31) at water.Lockable.unlock(Lockable.java:210) at water.Lockable.unlock(Lockable.java:205) at hex.rulefit.RuleFit$RuleFitDriver.computeImpl(RuleFit.java:272) at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:253) at water.H2O$H2OCountedCompleter.compute(H2O.java:1689) at jsr166y.CountedCompleter.exec(CountedCompleter.java:468) at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976) at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479) at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) '). Traceback (most recent call last): File "/home//mambaforge/envs/mlenv_new/lib/python3.11/site-packages/optuna/study/_optimize.py", line 200, in _run_trial value_or_values = func(trial) ^^^^^^^^^^^ File "/tmp/ipykernel_56818/2565691037.py", line 59, in objective rfit.train(training_frame = train, File "/home//mambaforge/envs/mlenv_new/lib/python3.11/site-packages/h2o/estimators/estimator_base.py", line 107, in train self._train(parms, verbose=verbose) File "/home//mambaforge/envs/mlenv_new/lib/python3.11/site-packages/h2o/estimators/estimator_base.py", line 199, in _train job.poll(poll_updates=self._print_model_scoring_history if verbose else None) File "/home//mambaforge/envs/mlenv_new/lib/python3.11/site-packages/h2o/job.py", line 88, in poll raise EnvironmentError("Job with key {} failed with an exception: {} stacktrace: " OSError: Job with key $03017f00000132d4ffffffff$_ad3f38a6ff226bfeeae60fe6934c4730 failed with an exception: java.lang.AssertionError: Trying to unlock null! (key = rfit1) stacktrace: java.lang.AssertionError: Trying to unlock null! (key = rfit1) at water.Lockable$Unlock.atomic(Lockable.java:225) at water.Lockable$Unlock.atomic(Lockable.java:216) at water.TAtomic.atomic(TAtomic.java:18) at water.Atomic.compute2(Atomic.java:56) at water.Atomic.fork(Atomic.java:39) at water.Atomic.invoke(Atomic.java:31) at water.Lockable.unlock(Lockable.java:210) at water.Lockable.unlock(Lockable.java:205) at hex.rulefit.RuleFit$RuleFitDriver.computeImpl(RuleFit.java:272) at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:253) at water.H2O$H2OCountedCompleter.compute(H2O.java:1689) at jsr166y.CountedCompleter.exec(CountedCompleter.java:468) at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976) at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479) at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

Thank you for any help

wendycwong commented 9 months ago

@BenUze : thank you for bring this up. I ran into error too. Will resolve.

wendycwong commented 9 months ago

@maurever

Just ran the pyunit_cancer_rulefit.py and ran into the following error:

image

maurever commented 9 months ago

@wendycwong, your error does not correspond with this bug report; see https://github.com/h2oai/h2o-3/pull/15974.

The pyunit_cancer_rulefit.py does not reproduce the error the @BenUze mentioned.

maurever commented 9 months ago

@BenUze, could you please share the working code on a sample dataset, if possible?

It looks like the Rulefit model was deleted before finishing the model training... However, I am not able to reproduce the error...

BenUze commented 7 months ago

Hi,

Sorry for this late answer, I've been trying to solve the issue on my own. It appears as @maurever said that the model is deleted before training. During troubleshooting, I noticed that the created model disappears from H2O Flow during the process. I have tried to reproduce the error with the example given in H2O-3 example for Rulefit with the titanic dataset but it worked as intended. As you will see in the uploaded files, I am using optuna for hyperparameter optimization and it didn't cause any issue with the titanic dataset.

Thank you again for your help

code_dataset.zip

BenUze commented 7 months ago

Hi,

While trying to get RuleFit to work again, I've come across a new error message. The code is the same but I started H2O from Unix shell with 32 Gigs dedicated to the JVM.

Error message : "[W 2024-01-31 16:46:10,264] Trial 0 failed with parameters: {'algorithm': 'drf', 'max_num_rules': 7, 'max_rule_length': 1, 'model_type': 'rules', 'min_rule_length': 1} because of the following error: OSError('Job with key $03010a0a1d039c05ffffffff$_81a9d2850b608371db141fd9fd67433c failed with an exception: java.lang.NullPointerException\nstacktrace: \njava.lang.NullPointerException\n\tat water.Lockable$Unlock.atomic(Lockable.java:231)\n\tat water.Lockable$Unlock.atomic(Lockable.java:216)\n\tat water.TAtomic.atomic(TAtomic.java:18)\n\tat water.Atomic.compute2(Atomic.java:56)\n\tat water.Atomic.fork(Atomic.java:39)\n\tat water.Atomic.invoke(Atomic.java:31)\n\tat water.Lockable.unlock(Lockable.java:210)\n\tat water.Lockable.unlock(Lockable.java:205)\n\tat hex.rulefit.RuleFit$RuleFitDriver.computeImpl(RuleFit.java:272)\n\tat hex.ModelBuilder$Driver.compute2(ModelBuilder.java:253)\n\tat water.H2O$H2OCountedCompleter.compute(H2O.java:1689)\n\tat jsr166y.CountedCompleter.exec(CountedCompleter.java:468)\n\tat jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)\n\tat jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976)\n\tat jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)\n\tat jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)\n'). Traceback (most recent call last): File "/home/radars/mambaforge/envs/mlenv_new/lib/python3.11/site-packages/optuna/study/_optimize.py", line 200, in _run_trial value_or_values = func(trial) ^^^^^^^^^^^ File "/tmp/ipykernel_46934/2371827781.py", line 58, in objective rfit.train(training_frame = train, File "/home/radars/mambaforge/envs/mlenv_new/lib/python3.11/site-packages/h2o/estimators/estimator_base.py", line 107, in train self._train(parms, verbose=verbose) File "/home/radars/mambaforge/envs/mlenv_new/lib/python3.11/site-packages/h2o/estimators/estimator_base.py", line 199, in _train job.poll(poll_updates=self._print_model_scoring_history if verbose else None) File "/home/radars/mambaforge/envs/mlenv_new/lib/python3.11/site-packages/h2o/job.py", line 88, in poll raise EnvironmentError("Job with key {} failed with an exception: {}\nstacktrace: " OSError: Job with key $03010a0a1d039c05ffffffff$_81a9d2850b608371db141fd9fd67433c failed with an exception: java.lang.NullPointerException stacktrace: java.lang.NullPointerException at water.Lockable$Unlock.atomic(Lockable.java:231) at water.Lockable$Unlock.atomic(Lockable.java:216) at water.TAtomic.atomic(TAtomic.java:18) at water.Atomic.compute2(Atomic.java:56) at water.Atomic.fork(Atomic.java:39) at water.Atomic.invoke(Atomic.java:31) at water.Lockable.unlock(Lockable.java:210) at water.Lockable.unlock(Lockable.java:205) at hex.rulefit.RuleFit$RuleFitDriver.computeImpl(RuleFit.java:272) at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:253) at water.H2O$H2OCountedCompleter.compute(H2O.java:1689) at jsr166y.CountedCompleter.exec(CountedCompleter.java:468) at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976) at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479) at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)"

BenUze commented 7 months ago

Hi,

I have found the root cause of my issue. I was using 'enum' data type for ordinal encoded strings variables. After casting these values to integers, everything is working properly again.

However, there is the issue that this problem appeared after I had already used the model with success on the same dataset with identical preprocessing.

I'm closing the issue.