h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.92k stars 2k forks source link

30 container cloud build (also during parse) on mr-0xd* cluster. Got ERR messages but h2odriver didn't abort the cloud build #14777

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

1) ERR should always abort things. If we don't want to abort on this, is the cloud broken or not? how does the user know? Is the cloud the size he asked for? or some other size? Seems bad not to abort.

2) What is the root cause of the ERR? I speculate below.

from eric's first log

he gets the same thing during cloud building as capone? (the "can not send ack through")

he's doing multiple containers on the same mr-0xd* machines to get to 30

so is the problem here related to 30 jvms? or having the multiple containers on one machine slows things down so the error messages happen during cloud building.

it's odd that eric got the same thing capone got here.

maybe capone has multiple containers on one machine too? I glanced at their IPs and thought they were all different machines, but unsure. Probably need to check.

maybe just a matter of the machine being busy with other stuff during the cloud building? affects the latency of whatever the cloud building is checking here?

(I wonder if 10-15 jvms on one machine would get similar messaging during cloud building. we used to test that in h2o2, I haven't so much in h2o3)

Also: why doesn't h2o fail on this ERR. If it's an ERR, then something should abort the whole works. Otherwise it should be a WARN if we want things to stay up. (what's supposed to detect this ERR and abort things?)

07-30 11:33:34.443 172.16.2.181:54329 32745 main INFO: H2O cloud name: 'H2O_24968' on /172.16.2.181:54329, discovery address /237.59.191.31:60731 07-30 11:33:34.444 172.16.2.181:54329 32745 main INFO: If you have trouble connecting, try SSH tunneling from your local machine (e.g., via port 55555): 07-30 11:33:34.444 172.16.2.181:54329 32745 main INFO: 1. Open a terminal and run 'ssh -L 55555:localhost:54329 yarn@172.16.2.181' 07-30 11:33:34.444 172.16.2.181:54329 32745 main INFO: 2. Point your browser to http://localhost:55555 07-30 11:33:34.767 172.16.2.181:54329 32745 main INFO: Log dir: '/opt2/hdp/yarn/local/usercache/eric/appcache/application_1435999527262_1542/h2ologs' 07-30 11:33:34.767 172.16.2.181:54329 32745 main INFO: Cur dir: '/home2/hdp/yarn/local/usercache/eric/appcache/application_1435999527262_1542/container_e04_1435999527262_1542_01_000022' 07-30 11:33:34.777 172.16.2.181:54329 32745 main INFO: Using HDFS configuration from /etc/hadoop/conf 07-30 11:33:34.777 172.16.2.181:54329 32745 main INFO: HDFS subsystem successfully initialized 07-30 11:33:34.777 172.16.2.181:54329 32745 main INFO: S3 subsystem successfully initialized 07-30 11:33:34.811 172.16.2.181:54329 32745 main INFO: Flow dir: 'hdfs://mr-0xd6.0xdata.loc:8020/user/eric/h2oflows' 07-30 11:33:34.833 172.16.2.181:54329 32745 main INFO: Cloud of size 1 formed [/172.16.2.181:54329] EmbeddedH2OConfig: notifyAboutCloudSize called (172.16.2.181, 54329, 1) 07-30 11:33:34.835 172.16.2.181:54329 32745 main INFO: Registered 0 extensions in: 587mS 07-30 11:33:35.209 172.16.2.181:54329 32745 main INFO: Registered: 104 REST APIs in: 374mS 07-30 11:33:35.743 172.16.2.181:54329 32745 main INFO: Registered: 171 schemas in: 534mS POST 12: After main POST 14: Waiting for exit 07-30 11:33:38.055 172.16.2.181:54329 32745 FJ-126-15 INFO: Cloud of size 30 formed [/172.16.2.181:54321, /172.16.2.181:54325, /172.16.2.181:54329, /172.16.2.182:54321, /172.16.2.182:54325, /172.16.2.182:54329, /172.16.2.183:54321, /172.16.2.183:54325, /172.16.2.183:54329, /172.16.2.184:54321, /172.16.2.184:54325, /172.16.2.184:54329, /172.16.2.185:54321, /172.16.2.185:54325, /172.16.2.185:54329, /172.16.2.186:54321, /172.16.2.186:54325, /172.16.2.186:54329, /172.16.2.187:54321, /172.16.2.187:54325, /172.16.2.187:54333, /172.16.2.188:54321, /172.16.2.188:54325, /172.16.2.188:54329, /172.16.2.189:54321, /172.16.2.189:54325, /172.16.2.189:54329, /172.16.2.190:54325, /172.16.2.190:54329, /172.16.2.190:54333] EmbeddedH2OConfig: notifyAboutCloudSize called (172.16.2.181, 54329, 30) 07-30 12:05:23.140 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.186:46090 07-30 12:05:23.173 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.181:36424 07-30 12:05:23.175 172.16.2.181:54329 32745 FJ-123-15 INFO: Locking cloud to new members, because water.Lockable$PriorWriteLock 07-30 12:05:30.764 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.187:53157 07-30 12:05:30.838 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.187:53197 07-30 12:05:30.859 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.182:38602 07-30 12:05:30.861 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.190:41076 07-30 12:05:30.862 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.184:40940 07-30 12:05:30.863 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.186:46494 07-30 12:05:30.923 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.182:38627 07-30 12:05:30.937 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.189:43618 07-30 12:05:31.017 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.188:57035 07-30 12:05:31.134 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.183:41226 07-30 12:05:31.319 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.185:48260 07-30 12:05:31.322 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.190:41136 07-30 12:05:31.327 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.186:46518 07-30 12:05:45.108 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.184:41056 07-30 12:05:55.277 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.181:36906 07-30 12:05:58.829 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.182:38769 07-30 12:06:11.719 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.190:41255 07-30 12:06:23.278 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.189:43788 07-30 12:06:24.804 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.183:41370 07-30 12:06:26.351 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.188:57176 07-30 12:06:33.133 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.188:57217 07-30 12:06:35.162 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.183:41416 07-30 12:06:35.503 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.184:41183 07-30 12:06:36.338 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.185:48420 07-30 12:06:36.740 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.187:53396 07-30 12:06:37.919 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.185:48428 07-30 12:06:38.410 172.16.2.181:54329 32745 #P-Accept INFO: starting new UDP-TCP receiver thread connected to /172.16.2.189:43854 07-30 12:07:26.910 172.16.2.181:54329 32745 FJ-121-35 ERRR: Possibly broken network, can not send ack through, got 10 07-30 12:07:31.593 172.16.2.181:54329 32745 FJ-121-35 ERRR: Possibly broken network, can not send ack through, got 20 07-30 12:07:31.593 172.16.2.181:54329 32745 FJ-123-55 ERRR: Possibly broken network, can not send ack through, got 40 07-30 12:07:31.593 172.16.2.181:54329 32745 FJ-121-7 ERRR: Possibly broken network, can not send ack through, got 10 07-30 12:07:31.593 172.16.2.181:54329 32745 FJ-123-13 ERRR: Possibly broken network, can not send ack through, got 30 07-30 12:10:47.695 172.16.2.181:54329 32745 FJ-123-23 ERRR: Possibly broken network, can not send ack through, got 10 07-30 12:10:47.695 172.16.2.181:54329 32745 FJ-121-25 ERRR: Possibly broken network, can not send ack through, got 60 07-30 12:10:47.695 172.16.2.181:54329 32745 FJ-121-51 ERRR: Possibly broken network, can not send ack through, got 70 07-30 12:10:47.695 172.16.2.181:54329 32745 FJ-121-1 ERRR: Possibly broken network, can not send ack through, got 50 07-30 12:10:47.695 172.16.2.181:54329 32745 FJ-121-37 ERRR: Possibly broken network, can not send ack through, got 40 07-30 12:10:47.695 172.16.2.181:54329 32745 FJ-123-3 ERRR: Possibly broken network, can not send ack through, got 30 07-30 12:10:47.695 172.16.2.181:54329 32745 FJ-123-9 ERRR: Possibly broken network, can not send ack through, got 20 07-30 12:10:47.695 172.16.2.181:54329 32745 FJ-121-1 ERRR: Possibly broken network, can not send ack through, got 100 07-30 12:10:47.695 172.16.2.181:54329 32745 FJ-123-23 ERRR: Possibly broken network, can not send ack through, got 90 07-30 12:10:47.695 172.16.2.181:54329 32745 FJ-121-37 ERRR: Possibly broken network, can not send ack through, got 110 07-30 12:10:47.695 172.16.2.181:54329 32745 FJ-123-59 ERRR: Possibly broken network, can not send ack through, got 80 07-30 12:10:47.695 172.16.2.181:54329 32745 FJ-121-51 ERRR: Possibly broken network, can not send ack through, got 120 07-30 12:10:47.696 172.16.2.181:54329 32745 FJ-121-25 ERRR: Possibly broken network, can not send ack through, got 130 07-30 12:10:47.696 172.16.2.181:54329 32745 FJ-121-51 ERRR: Possibly broken network, can not send ack through, got 140 07-30 12:10:47.696 172.16.2.181:54329 32745 FJ-123-59 ERRR: Possibly broken network, can not send ack through, got 150 07-30 12:10:47.696 172.16.2.181:54329 32745 FJ-121-37 ERRR: Possibly broken network, can not send ack through, got 160 07-30 12:10:47.696 172.16.2.181:54329 32745 FJ-123-23 ERRR: Possibly broken network, can not send ack through, got 170

exalate-issue-sync[bot] commented 1 year ago

Kevin Normoyle commented: eric gets the same "broken network" during the parse, but the python client didn't get the ERR message

shouldn't this cause some kind of abort to the client? If it's an ERR, then the cloud is broken.

If it's not an ERR, it should say WARN

it's very unclear what this message is trying to tell the user, and why it's not so severe that it doesn't shut down the cloud

(plus it's exactly what capone saw. How come eric sees this right away, but supposedly the capone case was tried before the code was sent to capone?

maybe it's always been in the logs but no one noticed because it doesn't propagate to the client or shut down the cloud?

it's a fatal error from my point of view. If you want tests to fail, you have to stack trace here?

14006 #/3/Parse INFO: Parse chunk size 8388608 07-30 12:15:10.779 172.16.2.186:54329 14006 #/3/Parse INFO: Parse chunk size 8388608 07-30 12:15:10.783 172.16.2.186:54329 14006 #/3/Parse INFO: Parse chunk size 8388608 07-30 12:15:10.788 172.16.2.186:54329 14006 #/3/Parse INFO: Parse chunk size 8388608 07-30 12:15:10.793 172.16.2.186:54329 14006 #/3/Parse INFO: Parse chunk size 8388608 07-30 12:15:10.797 172.16.2.186:54329 14006 #/3/Parse INFO: Parse chunk size 8388608 07-30 12:15:10.802 172.16.2.186:54329 14006 #/3/Parse INFO: Parse chunk size 8388608 07-30 12:15:10.806 172.16.2.186:54329 14006 #/3/Parse INFO: Parse chunk size 8388608 07-30 12:15:10.811 172.16.2.186:54329 14006 #/3/Parse INFO: Parse chunk size 8388608 07-30 12:15:10.816 172.16.2.186:54329 14006 #/3/Parse INFO: Parse chunk size 8388608 07-30 12:15:10.821 172.16.2.186:54329 14006 #/3/Parse INFO: Parse chunk size 8388608 07-30 12:15:10.825 172.16.2.186:54329 14006 #/3/Parse INFO: Parse chunk size 8388608 07-30 12:15:10.830 172.16.2.186:54329 14006 #/3/Parse INFO: Parse chunk size 8388608 07-30 12:15:10.835 172.16.2.186:54329 14006 #/3/Parse INFO: Parse chunk size 8388608 07-30 12:15:26.001 172.16.2.186:54329 14006 FJ-121-39 ERRR: Possibly broken network, can not send ack through, got 20 07-30 12:15:26.001 172.16.2.186:54329 14006 FJ-123-11 ERRR: Possibly broken network, can not send ack through, got 40 07-30 12:15:26.001 172.16.2.186:54329 14006 FJ-123-9 ERRR: Possibly broken network, can not send ack through, got 30 07-30 12:15:26.001 172.16.2.186:54329 14006 FJ-121-53 ERRR: Possibly broken network, can not send ack through, got 10 07-30 12:15:26.001 172.16.2.186:54329 14006 FJ-121-11 ERRR: Possibly broken network, can not send ack through, got 50 07-30 12:15:26.295 172.16.2.186:54329 14006 FJ-123-23 ERRR: Possibly broken network, can not send ack through, got 70 07-30 12:15:26.295 172.16.2.186:54329 14006 FJ-121-39 ERRR: Possibly broken network, can not send ack through, got 60 07-30 12:15:26.296 172.16.2.186:54329 14006 FJ-121-39 ERRR: Possibly broken network, can not send ack through, got 80 07-30 12:15:32.988 172.16.2.186:54329 14006 FJ-121-53 ERRR: Possibly broken network, can not send ack through, got 670 07-30 12:15:32.988 172.16.2.186:54329 14006 FJ-123-9 ERRR: Possibly broken network, can not send ack through, got 700 07-30 12:15:32.988 172.16.2.186:54329 14006 FJ-121-39 ERRR: Possibly broken network, can not send ack through, got 690 07-30 12:15:32.988 172.16.2.186:54329 14006 FJ-121-11 ERRR: Possibly broken network, can not send ack through, got 680 07-30 12:15:33.228 172.16.2.186:54329 14006 FJ-123-51 ERRR: Possibly broken network, can not send ack through, got 710 07-30 12:15:33.388 172.16.2.186:54329 14006 FJ-123-37 ERRR: Possibly broken network, can not send ack through, got 720 07-30 12:15:33.388 172.16.2.186:54329 14006 FJ-121-53 ERRR: Possibly broken network, can not send ack through, got 760 07-30 12:15:33.388 172.16.2.186:54329 14006 FJ-121-11 ERRR: Possibly broken network, can not send ack through, got 750 07-30 12:15:33.388 172.16.2.186:54329 14006 FJ-121-39 ERRR: Possibly broken network, can not send ack through, got 740 07-30 12:15:33.388 172.16.2.186:54329 14006 FJ-123-23 ERRR: Possibly broken network, can not send ack through, got 730 07-30 12:15:33.406 172.16.2.186:54329 14006 FJ-123-23 ERRR: Possibly broken network, can not send ack through, got 770 07-30 12:17:12.108 172.16.2.186:54329 14006 FJ-123-7 ERRR: Possibly broken network, can not send ack through, got 140 07-30 12:17:13.559 172.16.2.186:54329 14006 FJ-123-7 ERRR: Possibly broken network, can not send ack through, got 10 07-30 12:17:13.559 172.16.2.186:54329 14006 FJ-123-7 ERRR: Possibly broken network, can not send ack through, got 20 07-30 12:17:13.560 172.16.2.186:54329 14006 FJ-123-7 ERRR: Possibly broken network, can not send ack through, got 30

DinukaH2O commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-1818 Assignee: Cliff Click Reporter: Kevin Normoyle State: Open Fix Version: N/A Attachments: Available (Count: 1) Development PRs: N/A

Attachments From Jira

Attachment Name: capone.log.yarn1 Attached By: Kevin Normoyle File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-1818/capone.log.yarn1