Open awishnick opened 4 years ago
BoundedExecutor
gets into a failed state because the underlying executor refused task creation. This should only be possible due to some combination of a misconfiguration (sum of thread pool maximums are too large) and/or bugs (unbounded thread creation).
I recommend running with https://github.com/airlift/jvmkill as this will immediately kill the JVM if it cannot create a thread.
Note that this doesn't seem to have anything to do with register_partition
specifically, as that operation seems to execute entirely within the calling thread. This is more likely somewhere in the coordinator during execution.
Can you show the config properties for the coordinator and Hive connector?
I've attached my config. Thanks for the tip on jvmkill.
For what it's worth, I am also seeing this issue manifest as the client getting read timeouts when running queries, i.e. the coordinator never responds. I also see workers becoming unresponsive to the point that launcher stop
does nothing. This is presto 333 fwiw.
Coordinator config:
coordinator=true
node-scheduler.include-coordinator=false
discovery-server.enabled=true
http-server.http.port=8080
discovery.uri=http://xxxx:8080
query.max-memory=3000GB
query.max-total-memory=18000GB
query.max-memory-per-node=12GB
query.max-total-memory-per-node=36GB
query.max-execution-time=20m
query.max-run-time=30m
optimizer.join-reordering-strategy=AUTOMATIC
join-distribution-type=AUTOMATIC
connector.name=hive-hadoop2
hive.metastore.uri=thrift://xxxxx:9083
hive.metastore-timeout=3m
hive.s3.aws-access-key=xxxxx
hive.s3.aws-secret-key=xxxxx
hive.s3.endpoint=xxxxx
hive.s3.ssl.enabled=false
hive.s3.path-style-access=true
hive.parquet.use-column-names=true
hive.orc.use-column-names=true
hive.allow-drop-table=true
hive.allow-rename-table=true
hive.allow-register-partition-procedure=true
hive.create-empty-bucket-files=false
hive.non-managed-table-writes-enabled=true
hive.max-partitions-per-writers=5000
Ok, an update -- even with appropriate exponential backoff, if I hit the coordinator with ~3000 register_partition queries, it becomes permanently unresponsive. To the client this shows up as HTTP connection read timeouts. I also can't connect to the web UI. I also see the coordinator process at 100% CPU. While the coordinator is unresponsive, I see failures like this in server.log:
2020-05-08T17:30:42.608-0400 ERROR http-client-scheduler-4711 com.google.common.util.concurrent.AbstractFuture RuntimeException while executing runnable CallbackListener{io.prestosql.server.remotetask.SimpleHttpResponseHandler@7ef1e726} with executor io.airlift.concurrent.BoundedExecutor@27a6117c
java.lang.IllegalStateException: BoundedExecutor is in a failed state
at com.google.common.base.Preconditions.checkState(Preconditions.java:508)
at io.airlift.concurrent.BoundedExecutor.execute(BoundedExecutor.java:55)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1174)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:969)
at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:738)
at io.airlift.http.client.jetty.JettyResponseFuture.completed(JettyResponseFuture.java:84)
at io.airlift.http.client.jetty.BufferingResponseListener.onComplete(BufferingResponseListener.java:90)
at org.eclipse.jetty.client.ResponseNotifier.notifyComplete(ResponseNotifier.java:218)
at org.eclipse.jetty.client.ResponseNotifier.notifyComplete(ResponseNotifier.java:210)
at org.eclipse.jetty.client.HttpReceiver.terminateResponse(HttpReceiver.java:543)
at org.eclipse.jetty.client.HttpReceiver.terminateResponse(HttpReceiver.java:523)
at org.eclipse.jetty.client.HttpReceiver.responseSuccess(HttpReceiver.java:486)
at org.eclipse.jetty.client.http.HttpReceiverOverHTTP.messageComplete(HttpReceiverOverHTTP.java:326)
at org.eclipse.jetty.http.HttpParser.handleContentMessage(HttpParser.java:580)
at org.eclipse.jetty.http.HttpParser.parseContent(HttpParser.java:1697)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:1526)
at org.eclipse.jetty.client.http.HttpReceiverOverHTTP.parse(HttpReceiverOverHTTP.java:200)
at org.eclipse.jetty.client.http.HttpReceiverOverHTTP.process(HttpReceiverOverHTTP.java:141)
at org.eclipse.jetty.client.http.HttpReceiverOverHTTP.receive(HttpReceiverOverHTTP.java:75)
at org.eclipse.jetty.client.http.HttpChannelOverHTTP.receive(HttpChannelOverHTTP.java:133)
at org.eclipse.jetty.client.http.HttpConnectionOverHTTP.onFillable(HttpConnectionOverHTTP.java:156)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:375)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
at java.base/java.lang.Thread.run(Thread.java:834)
From what I can tell I'm hitting up against the user process limit, which was 5k, but I bumped it to 15k and then 32k. During this test I've observed presto's process tree having > 8k processes. This is on a 12 core machine. Is that expected?
yes, go for max ulimit
Do we happen to know what was the original exception that was thrown by the underlying ExecutorService? It should be preceded by an error log that looks like: "BoundedExecutor state corrupted due to underlying executor failure"
I wrote some misbehaved client code that got our cluster into a bad state. @findepi suggested I document it here in case it helps. From what I can tell, https://github.com/airlift/airlift/pull/742 wouldn't help because it doesn't seem like the executor ever recovers.