h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.89k stars 2k forks source link

Interactive REST calls run at normal F/J priorities, and will wait for other normal F/J work #15046

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

10 nodes, mr-0xd*

parseFiles paths: ["/home/0xdiag/datasets/billions/four_billion_rows.csv"] destination_frame: "four_billion_rows.hex" parse_type: "CSV" separator: 44 number_columns: 2 single_quotes: false column_names: null column_types: ["Numeric","Enum"] delete_on_done: true check_header: -1 chunk_size: 4194304

buildModel 'deeplearning', {"model_id":"deeplearning-82ac6efa-06a8-400b-8a7d-87defafc5b73","training_frame":"four_billion_rows.hex","nfolds":0,"response_column":"C2","ignored_columns":[],"ignore_const_cols":true,"activation":"Rectifier","hidden":[200,200],"epochs":10,"variable_importances":false,"balance_classes":false,"max_confusion_matrix_size":20,"max_hit_ratio_k":10,"checkpoint":"","use_all_factor_levels":true,"train_samples_per_iteration":"-1","adaptive_rate":true,"input_dropout_ratio":0,"l1":0,"l2":0,"loss":"Automatic","distribution":"AUTO","score_interval":5,"score_training_samples":10000,"score_duty_cycle":0.1,"replicate_training_data":true,"autoencoder":false,"overwrite_with_best_model":true,"target_ratio_comm_to_comp":0.02,"seed":476458924607192960,"rho":0.99,"epsilon":1e-8,"max_w2":"Infinity","initial_weight_distribution":"UniformAdaptive","classification_stop":0,"diagnostics":true,"fast_mode":true,"force_load_balance":true,"single_node_mode":false,"shuffle_training_data":false,"missing_values_handling":"MeanImputation","quiet_mode":false,"sparse":false,"col_major":false,"average_activation":0,"sparsity_beta":0,"max_categorical_features":2147483647,"reproducible":false,"export_weights_and_biases":false}

While DL is training, call 'getFrames' from Flow, takes at least 10 minutes to respond (but will respond eventually).

"qtp285402953-13" prio=9 tid=13 java.lang.Thread.State: TIMED_WAITING

at java.lang.Object.wait(Native Method)
at water.RPC.block(RPC.java:273)
at jsr166y.ForkJoinPool.managedBlock(ForkJoinPool.java:2803)
at water.RPC.get(RPC.java:262)
at water.RPC.get(RPC.java:48)
at water.Futures.blockForPending(Futures.java:71)
at water.fvec.Frame.bulkRollups(Frame.java:400)
at water.fvec.Frame.byteSize(Frame.java:445)
at water.api.FrameSynopsisV3.<init>(FrameSynopsisV3.java:31)
at water.api.FramesBase.fillFromImplWithSynopsis(FramesBase.java:88)
at water.api.FramesHandler.list(FramesHandler.java:131)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at water.api.Handler.handle(Handler.java:58)
at water.api.RequestServer.handle(RequestServer.java:639)
at water.api.RequestServer.serve(RequestServer.java:580)
exalate-issue-sync[bot] commented 1 year ago

Cliff Click commented: REST calls for inspection, like "Frame" and "Frames" and various summaries require rollups, and run at the normal F/J priorities. If the F/J queues are slammed with other work, e.g. a big DL job, then the interactive commands run at a "best effort" basis - and so get stuck behind the DL work.

exalate-issue-sync[bot] commented 1 year ago

Cliff Click commented: Hopefully found and fixed all places where interactive calls are waiting for cores. Mostly it was ChunkSummary & Rollups

DinukaH2O commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-2109 Assignee: Cliff Click Reporter: Arno Candel State: Resolved Fix Version: N/A Attachments: N/A Development PRs: N/A