h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.88k stars 1.99k forks source link

IOException during ACK, Connection timed out #9467

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

I see this on the 16-node ec2 sri cluster that has been up for a few weeks.

This is node:

external IP: ec2-52-22-87-250.compute-1.amazonaws.com internal IP: 10.10.0.140

STDERR

log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. IO error java.lang.ArrayIndexOutOfBoundsException: 2 at water.H2ONode.freeTCPSocket(H2ONode.java:423) at water.AutoBuffer.close(AutoBuffer.java:395) at water.RPC.remote_exec(RPC.java:533) at water.TCPReceiverThread$TCPReaderThread.run(TCPReceiverThread.java:205)

STDOUT

[ skip first few normal lines... ]

12-27 18:57:27.329 10.10.0.140:54321 20660 #27877-14 WARN: Resource not found: //MyAdmin/scripts/setup.php 12-27 18:57:27.330 10.10.0.140:54321 20660 #27877-14 WARN: Resource //MyAdmin/scripts/setup.php not found 12-27 18:57:27.330 10.10.0.140:54321 20660 #27877-14 WARN: {} 12-27 18:57:27.330 10.10.0.140:54321 20660 #27877-14 WARN: water.api.RequestServer.response404(RequestServer.java:491)water.api.RequestServer.getResource(RequestServer.java:725)water.api.RequestServer.serve(RequestServer.java:577)water.JettyHTTPD$H2oDefaultSer vlet.doGeneric(JettyHTTPD.java:617)water.JettyHTTPD$H2oDefaultServlet.doGet(JettyHTTPD.java:559)javax.servlet.http.HttpServlet.service(HttpServlet.java:735)javax.servlet.http.HttpServlet.service(HttpServlet.java:848)org.eclipse.jetty.servlet.ServletHolder.handle(Se rvletHolder.java:684) 12-27 19:14:19.025 10.10.0.140:54321 20660 #27877-13 INFO: Method: GET , URI: /, route: , parms: {} 12-28 00:51:30.263 10.10.0.140:54321 20660 FJ-1-3 INFO: IOException during ACK, Connection timed out, t#10952 AB=[AB write 2nd /10.10.0.139:54321 TCP], waiting and retrying... 12-28 00:51:30.363 10.10.0.140:54321 20660 FJ-1-3 INFO: Cancelled remote task#10952 class water.fvec.RollupStats$Histo to /10.10.0.139:54321 has been cancelled by remote 12-28 00:51:30.774 10.10.0.140:54321 20660 #:54321-0 ERRR: java.io.IOException: Connection timed out 12-28 00:51:34.358 10.10.0.140:54321 20660 #:54321-1 ERRR: java.io.IOException: Connection timed out 12-28 00:51:34.358 10.10.0.140:54321 20660 #:54321-0 ERRR: java.io.IOException: Connection timed out 12-28 01:07:01.078 10.10.0.140:54321 20660 FJ-1-21 INFO: IOException during ACK, Connection timed out, t#10960 AB=[AB write 2nd /10.10.0.139:54321 TCP], waiting and retrying... 12-28 01:07:01.179 10.10.0.140:54321 20660 FJ-1-21 INFO: Cancelled remote task#10960 class water.fvec.RollupStats$Histo to /10.10.0.139:54321 has been cancelled by remote 12-28 01:22:24.726 10.10.0.140:54321 20660 FJ-0-3 INFO: IOException during RPC call: Connection timed out, AB=[AB write 2nd /10.10.0.85:54321 TCP], for task#482, waiting and retrying... 12-28 01:22:32.918 10.10.0.140:54321 20660 #:54321-1 ERRR: java.io.IOException: Connection timed out 12-28 01:37:52.470 10.10.0.140:54321 20660 FJ-0-3 INFO: IOException during RPC call: Connection timed out, AB=[AB write 2nd /10.10.0.87:54321 TCP], for task#306, waiting and retrying... 12-28 02:09:04.254 10.10.0.140:54321 20660 #:54321-2 ERRR: IO error on TCP port 54322: java.lang.ArrayIndexOutOfBoundsException: 2 12-28 02:09:24.793 10.10.0.140:54321 20660 FJ-123-3 WARN: got tcp with existing task #, FROM /10.10.0.139:54321 AB: 12-28 02:09:44.795 10.10.0.140:54321 20660 FJ-123-3 WARN: got tcp with existing task #, FROM /10.10.0.139:54321 AB: 12-28 02:10:24.799 10.10.0.140:54321 20660 FJ-123-3 WARN: got tcp with existing task #, FROM /10.10.0.139:54321 AB: 12-28 02:11:24.807 10.10.0.140:54321 20660 FJ-123-3 WARN: got tcp with existing task #, FROM /10.10.0.139:54321 AB: 12-28 02:12:24.813 10.10.0.140:54321 20660 FJ-123-3 WARN: got tcp with existing task #, FROM /10.10.0.139:54321 AB: 12-28 02:13:24.821 10.10.0.140:54321 20660 FJ-123-3 WARN: got tcp with existing task #, FROM /10.10.0.139:54321 AB: 12-28 02:14:24.828 10.10.0.140:54321 20660 FJ-123-3 WARN: got tcp with existing task #, FROM /10.10.0.139:54321 AB:

this repeats thousands of times

01-04 23:02:25.073 10.10.0.140:54321 20660 FJ-123-3 WARN: got tcp with existing task #, FROM /10.10.0.139:54321 AB: 01-04 23:03:25.081 10.10.0.140:54321 20660 FJ-123-3 WARN: got tcp with existing task #, FROM /10.10.0.139:54321 AB: 01-04 23:04:25.085 10.10.0.140:54321 20660 FJ-123-3 WARN: got tcp with existing task #, FROM /10.10.0.139:54321 AB: 01-04 23:05:25.093 10.10.0.140:54321 20660 FJ-123-3 WARN: got tcp with existing task #, FROM /10.10.0.139:54321 AB: 01-04 23:06:25.103 10.10.0.140:54321 20660 FJ-123-3 WARN: got tcp with existing task #, FROM /10.10.0.139:54321 AB: 01-04 23:07:25.112 10.10.0.140:54321 20660 FJ-123-3 WARN: got tcp with existing task #, FROM /10.10.0.139:54321 AB: 01-04 23:08:25.120 10.10.0.140:54321 20660 FJ-123-3 WARN: got tcp with existing task #, FROM /10.10.0.139:54321 AB: 01-04 23:09:25.128 10.10.0.140:54321 20660 FJ-123-3 WARN: got tcp with existing task #, FROM /10.10.0.139:54321 AB: 01-04 23:10:25.136 10.10.0.140:54321 20660 FJ-123-3 WARN: got tcp with existing task #, FROM /10.10.0.139:54321 AB: 01-04 23:11:25.144 10.10.0.140:54321 20660 FJ-123-3 WARN: got tcp with existing task #, FROM /10.10.0.139:54321 AB: 01-04 23:13:54.034 10.10.0.140:54321 20660 FJ-0-61 ERRR: java.lang.ArrayIndexOutOfBoundsException: 2 01-04 23:13:54.034 10.10.0.140:54321 20660 FJ-0-61 ERRR: java.lang.ArrayIndexOutOfBoundsException: 2 01-04 23:13:54.035 10.10.0.140:54321 20660 FJ-0-61 ERRR: java.lang.ArrayIndexOutOfBoundsException: 2 01-04 23:13:54.035 10.10.0.140:54321 20660 FJ-0-61 ERRR: java.lang.ArrayIndexOutOfBoundsException: 2 01-04 23:13:54.035 10.10.0.140:54321 20660 FJ-0-61 ERRR: java.lang.ArrayIndexOutOfBoundsException: 2 01-04 23:13:54.035 10.10.0.140:54321 20660 FJ-0-61 ERRR: java.lang.ArrayIndexOutOfBoundsException: 2 01-04 23:13:54.035 10.10.0.140:54321 20660 FJ-0-61 ERRR: java.lang.ArrayIndexOutOfBoundsException: 2 01-04 23:13:54.035 10.10.0.140:54321 20660 FJ-0-61 ERRR: java.lang.ArrayIndexOutOfBoundsException: 2 01-04 23:13:54.036 10.10.0.140:54321 20660 FJ-0-61 ERRR: java.lang.ArrayIndexOutOfBoundsException: 2

this repeats millions of times

01-05 01:00:05.139 10.10.0.140:54321 20660 FJ-0-61 ERRR: java.lang.ArrayIndexOutOfBoundsException 01-05 01:00:05.139 10.10.0.140:54321 20660 FJ-0-61 ERRR: java.lang.ArrayIndexOutOfBoundsException 01-05 01:00:05.139 10.10.0.140:54321 20660 FJ-0-61 ERRR: java.lang.ArrayIndexOutOfBoundsException 01-05 01:00:05.139 10.10.0.140:54321 20660 FJ-0-61 ERRR: java.lang.ArrayIndexOutOfBoundsException 01-05 01:00:05.139 10.10.0.140:54321 20660 FJ-0-61 ERRR: java.lang.ArrayIndexOutOfBoundsException 01-05 01:00:05.140 10.10.0.140:54321 20660 FJ-0-61 ERRR: java.lang.ArrayIndexOutOfBoundsException 01-05 01:00:05.140 10.10.0.140:54321 20660 FJ-0-61 ERRR: java.lang.ArrayIndexOutOfBoundsException 01-05 01:00:05.140 10.10.0.140:54321 20660 FJ-0-61 ERRR: java.lang.ArrayInde

And an 18GB h2o.out stdout output file filled up the disk.

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-2523 Assignee: Tomas Nykodym Reporter: Tom Kraljevic State: Resolved Fix Version: N/A Attachments: N/A Development PRs: N/A