h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.89k stars 2k forks source link

GLM: The first model should not throw an npe on completion, if the user tries to run another model with the same destination name while the first one is still running #14012

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Parse and big dataset, run a big model, say lambda search. While the first job is running change alpha value in the glm frame, and fire another job with same destination frame. Get- Got exception 'class java.lang.IllegalArgumentException', with msg 'class hex.glm.GLMModel glm-ccedb30b-839b-43ce-8395-3fdec800c9f2 is already in use. Unable to use it now. Consider using a different destination name.' java.lang.IllegalArgumentException: class hex.glm.GLMModel glm-ccedb30b-839b-43ce-8395-3fdec800c9f2 is already in use. Unable to use it now. Consider using a different destination name. at water.Lockable$PriorWriteLock.atomic(Lockable.java:109) at water.Lockable$PriorWriteLock.atomic(Lockable.java:98) at water.TAtomic.atomic(TAtomic.java:17) at water.Atomic.compute2(Atomic.java:55) at water.Atomic.fork(Atomic.java:39) at water.Atomic.invoke(Atomic.java:31) at water.Lockable.write_lock(Lockable.java:59) at water.Lockable.delete_and_lock(Lockable.java:66) at hex.glm.GLM.init(GLM.java:315) at hex.glm.GLM$GLMDriver.compute2(GLM.java:625) at water.H2O$H2OCountedCompleter.compute(H2O.java:682) at jsr166y.CountedCompleter.exec(CountedCompleter.java:429) at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974) at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477) at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

which is good. But when the original (1st job) gets done (it shows a status as done ) Get- Error calling GET /3/Models/glm-ccedb30b-839b-43ce-8395-3fdec800c9f2 with opts null

Object 'glm-ccedb30b-839b-43ce-8395-3fdec800c9f2' not found for argument: key

TOGGLE STACK TRACE Object 'glm-ccedb30b-839b-43ce-8395-3fdec800c9f2' not found for argument: key (water.exceptions.H2OKeyNotFoundArgumentException) water.api.ModelsHandler.getFromDKV(ModelsHandler.java:105) water.api.ModelsHandler.fetch(ModelsHandler.java:124) sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:606) water.api.Handler.handle(Handler.java:57) water.api.RequestServer.handle(RequestServer.java:675) water.api.RequestServer.serve(RequestServer.java:611) water.NanoHTTPD$HTTPSession.run(NanoHTTPD.java:436) java.lang.Thread.run(Thread.java:744)

Flow steps are in the attached file.

exalate-issue-sync[bot] commented 1 year ago

Raymond Peck commented: Nidhi, can you see if this happens with all the algos, or only with GLM? Thanks!

exalate-issue-sync[bot] commented 1 year ago

Nidhi Mehta commented: For DL got an npe on an earlier build. Cannot repro on latest master though, may be a timing issue.

But do get the below error msg, which is not useful for the user- While the first model is getting build when fire the second model, Get -

buildModel 'deeplearning', {"model_id":"deeplearning-276d1733-b417-4a66-ad44-42b3e8ab1095","training_frame":"Key_Frame__covktr.hex","drop_na20_cols":false,"response_column":"V55","activation":"TanhWithDropout","hidden":[200,20],"epochs":10,"variable_importances":false,"balance_classes":false,"checkpoint":"","use_all_factor_levels":true,"train_samples_per_iteration":-2,"adaptive_rate":true,"input_dropout_ratio":0,"hidden_dropout_ratios":[],"l1":0,"l2":0,"loss":"Automatic","score_interval":5,"score_training_samples":10000,"score_duty_cycle":0.1,"autoencoder":false,"override_with_best_model":true,"target_ratio_comm_to_comp":0.02,"seed":842683419038542800,"rho":0.99,"epsilon":1e-8,"max_w2":"Infinity","initial_weight_distribution":"UniformAdaptive","classification_stop":0,"diagnostics":true,"fast_mode":true,"ignore_const_cols":true,"force_load_balance":true,"single_node_mode":false,"shuffle_training_data":false,"missing_values_handling":"MeanImputation","quiet_mode":false,"sparse":false,"col_major":false,"average_activation":0,"sparsity_beta":0,"max_categorical_features":2147483647,"reproducible":false,"export_weights_and_biases":false}

JOB FAILURE.

Got exception 'class java.lang.AssertionError', with msg 'Can't unlock: Not locked!' java.lang.AssertionError: Can't unlock: Not locked! at water.Lockable$Unlock.atomic(Lockable.java:181) at water.Lockable$Unlock.atomic(Lockable.java:176) at water.TAtomic.atomic(TAtomic.java:17) at water.Atomic.compute2(Atomic.java:55) at water.Atomic.fork(Atomic.java:39) at water.Atomic.invoke(Atomic.java:31) at water.Lockable.unlock(Lockable.java:171) at hex.deeplearning.DeepLearning$DeepLearningDriver.trainModel(DeepLearning.java:477) at hex.deeplearning.DeepLearning$DeepLearningDriver.buildModel(DeepLearning.java:331) at hex.deeplearning.DeepLearning$DeepLearningDriver.compute2(DeepLearning.java:134) at water.H2O$H2OCountedCompleter.compute(H2O.java:682) at jsr166y.CountedCompleter.exec(CountedCompleter.java:429) at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974) at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477) at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

GBM is fine

exalate-issue-sync[bot] commented 1 year ago

Nidhi Mehta commented: On jenkins 1198

Get for glm and DL

DL Got exception 'class water.DException$DistributedException', with msg 'from /10.10.0.77:54321; by class water.Lockable$Unlock; class java.lang.NullPointerException: null' water.DException$DistributedException: from /10.10.0.77:54321; by class water.Lockable$Unlock; class java.lang.NullPointerException: null at water.Lockable.set_unlocked(Lockable.java:212) at water.Lockable.access$400(Lockable.java:25) at water.Lockable$Unlock.atomic(Lockable.java:182) at water.Lockable$Unlock.atomic(Lockable.java:176) at water.TAtomic.atomic(TAtomic.java:17) at water.Atomic.compute2(Atomic.java:55) at water.H2O$H2OCountedCompleter.compute(H2O.java:682) at jsr166y.CountedCompleter.exec(CountedCompleter.java:429) at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974) at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477) at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

TOGGLE STACK TRACE Type Model Key deeplearning-093dab23-c398-4043-b4ef-863f91482db3 Description DeepLearning Status FAILED Run Time 00:00:21.354 Progress
100%
FAILED Actions View

GLM-

Error calling GET /3/Models/glm-7311db02-21e4-4b32-8bb4-e896a5ba6824 with opts null

Object 'glm-7311db02-21e4-4b32-8bb4-e896a5ba6824' not found for argument: key

TOGGLE STACK TRACE Object 'glm-7311db02-21e4-4b32-8bb4-e896a5ba6824' not found for argument: key (water.exceptions.H2OKeyNotFoundArgumentException) water.api.ModelsHandler.getFromDKV(ModelsHandler.java:105) water.api.ModelsHandler.fetch(ModelsHandler.java:124) sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:606) water.api.Handler.handle(Handler.java:57) water.api.RequestServer.handle(RequestServer.java:676) water.api.RequestServer.serve(RequestServer.java:612) water.NanoHTTPD$HTTPSession.run(NanoHTTPD.java:438) java.lang.Thread.run(Thread.java:745)

DinukaH2O commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-1031 Assignee: Tomas Nykodym Reporter: Nidhi Mehta State: Resolved Fix Version: N/A Attachments: Available (Count: 1) Development PRs: N/A

Attachments From Jira

Attachment Name: flow.txt Attached By: Nidhi Mehta File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-1031/flow.txt