h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.85k stars 1.99k forks source link

Low Memory condition not reported from Server. #7936

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

The logs show that I ran out of memory but H2O server didn't bother propagate the warning, so my training job died mysteriously:

H2OConnectionError: Unexpected HTTP error: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Obviously, it's my fault there wasn't enough memory (I have a DF of 93 million rows) but it would have been nice if during the training, I got some warning.

Also, it would have been good to get a swapping warning, as that would have told me that my training is running slowly because of lack of RAM.

Not exactly reproducible, which isn’t surprising since I’m using multiple cores.

System info

{noformat}$ lshw -shortlshw -shortH/W path Device Class Description==========================================system Computer/0 bus Motherboard/0/0 memory 373GiB System memory/0/1 processor Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz/0/100 bridge 440FX - 82441FX PMC [Natoma]/0/100/1 bridge 82371SB PIIX3 ISA [Natoma/Triton II]/0/100/1.3 generic 82371AB/EB/MB PIIX4 ACPI/0/100/3 display Amazon.com, Inc./0/100/4 storage Amazon.com, Inc./0/100/5 network Elastic Network Adapter (ENA)/0/100/6 network Elastic Network Adapter (ENA)/0/100/1e storage NVMe SSD Controller/0/100/1f storage NVMe SSD Controller/1 eth0 network Ethernet interface$ uname -aLinux [REDACTED] #1 SMP Wed Apr 29 09:56:20 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux$ lscpulscpuArchitecture: x86_64CPU op-mode(s): 32-bit, 64-bitByte Order: Little EndianCPU(s): 48On-line CPU(s) list: 0-47Thread(s) per core: 2Core(s) per socket: 24Socket(s): 1NUMA node(s): 1Vendor ID: GenuineIntelCPU family: 6Model: 85Model name: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHzStepping: 7CPU MHz: 3180.952BogoMIPS: 4999.99Hypervisor vendor: KVMVirtualization type: fullL1d cache: 32KL1i cache: 32KL2 cache: 1024KL3 cache: 36608KNUMA node0 CPU(s): 0-47Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke{noformat}

Python:

{code:java}h2o.init( strict_version_check=False,

nthreads=1,

log_dir="/tmp/clem-h2o/",
log_level='TRACE'

) # Yes, this is necessary! If you skip it, the H2OContext.getOrCreate() hangs

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found. Attempting to start a local H2O server... Java Version: openjdk version "1.8.0_252"; OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~18.04-b09); OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode) Starting server from /usr/local/lib/python3.7/dist-packages/h2o/backend/bin/h2o.jar Ice root: /tmp/tmpzx3n769z JVM stdout: /tmp/tmpzx3n769z/h2o_unknownUser_started_from_python.out JVM stderr: /tmp/tmpzx3n769z/h2o_unknownUser_started_from_python.err Server is running at http://127.0.0.1:54321 Connecting to H2O server at http://127.0.0.1:54321 ... successful. H2O_cluster_uptime: 01 secs H2O_cluster_timezone: Etc/UTC H2O_data_parsing_timezone: UTC H2O_cluster_version: 3.30.0.7 H2O_cluster_version_age: 7 days and 40 minutes H2O_cluster_name: H2O_from_python_unknownUser_lt01t8 H2O_cluster_total_nodes: 1 H2O_cluster_free_memory: 17.78 Gb H2O_cluster_total_cores: 10 H2O_cluster_allowed_cores: 10 H2O_cluster_status: accepting new members, healthy H2O_connection_url: http://127.0.0.1:54321 H2O_connection_proxy: {"http": null, "https": null} H2O_internal_security: False H2O_API_Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 Python_version: 3.7.5 final

logloss train, test

param = { "ntrees" : 6 # 10 # was 76 , "min_rows" : 6 # was 11 , "max_depth" : 8 # was 11 , "learn_rate" : 0.25 , "sample_rate" : 0.9 , 'col_sample_rate' : 0.95 , "col_sample_rate_per_tree" : 0.95 , "booster" : 'gbtree' , "seed": 42 , "score_tree_interval": 100

# These two params emulates Light GBM  https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/xgboost.html#lightgbm-emulation-mode-options

, "tree_method" : "hist"

, "grow_policy" : "lossguide" 

}

model = H2OXGBoostEstimator(**param) model.train(x=x, y=y, training_frame=hdftrain) {code}

Tail end of log file:

{code:java}07-28 18:41:59.131 127.0.0.1:54321 840 #e Thread WARN: Swapping! GC CALLBACK, (K/V:14.94 GB + POJO:613.5 MB + FREE:2.23 GB == MEM_MAX:17.78 GB), desiredKV=11.11 GB OOM! 07-28 18:41:54.820 127.0.0.1:54321 840 FJ-2-25 WARN: Swapping! OOM, (K/V:14.94 GB + POJO:613.2 MB + FREE:2.23 GB == MEM_MAX:17.78 GB), desiredKV=11.51 GB OOM! 07-28 18:43:13.482 127.0.0.1:54321 840 #e Thread WARN: Swapping! GC CALLBACK, (K/V:14.94 GB + POJO:613.5 MB + FREE:2.23 GB == MEM_MAX:17.78 GB), desiredKV=11.67 GB OOM! 07-28 18:43:19.684 127.0.0.1:54321 840 FJ-2-25 WARN: Swapping! OOM, (K/V:14.94 GB + POJO:613.5 MB + FREE:2.23 GB == MEM_MAX:17.78 GB), desiredKV=11.67 GB OOM! 07-28 18:43:28.999 127.0.0.1:54321 840 #e Thread WARN: Swapping! GC CALLBACK, (K/V:14.94 GB + POJO:613.5 MB + FREE:2.23 GB == MEM_MAX:17.78 GB), desiredKV=10.86 GB OOM! 07-28 18:44:06.019 127.0.0.1:54321 840 FJ-2-123 WARN: Swapping! OOM, (K/V:14.94 GB + POJO:613.2 MB + FREE:2.23 GB == MEM_MAX:17.78 GB), desiredKV=11.01 GB OOM! 07-28 18:45:33.343 127.0.0.1:54321 840 FJ-2-123 WARN: Swapping! OOM, (K/V:14.94 GB + POJO:613.6 MB + FREE:2.23 GB == MEM_MAX:17.78 GB), desiredKV=11.72 GB OOM! 07-28 18:45:58.000 127.0.0.1:54321 840 FJ-2-25 WARN: Swapping! OOM, (K/V:14.94 GB + POJO:613.5 MB + FREE:2.23 GB == MEM_MAX:17.78 GB), desiredKV=11.72 GB OOM! 07-28 18:46:21.296 127.0.0.1:54321 840 FJ-2-123 WARN: Swapping! OOM, (K/V:14.94 GB + POJO:613.6 MB + FREE:2.23 GB == MEM_MAX:17.78 GB), desiredKV=11.85 GB OOM! 07-28 18:46:24.411 127.0.0.1:54321 840 FJ-2-25 WARN: Swapping! OOM, (K/V:14.94 GB + POJO:613.6 MB + FREE:2.23 GB == MEM_MAX:17.78 GB), desiredKV=11.89 GB OOM! 07-28 18:46:25.024 127.0.0.1:54321 840 #e Thread WARN: Swapping! GC CALLBACK, (K/V:14.94 GB + POJO:613.6 MB + FREE:2.23 GB == MEM_MAX:17.78 GB), desiredKV=11.85 GB OOM!{code}

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-7703 Assignee: New H2O Bugs Reporter: Clem Wang State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A