DOS attack system.register_partition() can get BoundedExecutor into an unrecoverable state

awishnick commented 4 years ago

I wrote some misbehaved client code that got our cluster into a bad state. @findepi suggested I document it here in case it helps. From what I can tell, https://github.com/airlift/airlift/pull/742 wouldn't help because it doesn't seem like the executor ever recovers.

Create a new partitioned Hive table backed by S3, 1 Presto coordinator with ~80 workers
Concurrently call system.register_partition() for ~7000 partitions
Have clients ignore the QUERY_QUEUE_FULL errors and retry immediately to DOS the coordinator
Observe the "BoundedExecutor is in a failed state" errors. At this point, stop the DOS attack.

Note that no new queries work, the coordinator returns "BoundedExecutor is in a failed state"

java.lang.IllegalStateException: BoundedExecutor is in a failed state
at com.google.common.base.Preconditions.checkState(Preconditions.java:507)
at io.airlift.concurrent.BoundedExecutor.execute(BoundedExecutor.java:55)
at io.prestosql.spi.VersionEmbedder.lambda$embedVersion$0(VersionEmbedder.java:33)
at io.prestosql.plugin.hive.HiveSplitManager$ErrorCodedExecutor.execute(HiveSplitManager.java:488)
at io.prestosql.plugin.hive.util.ResumableTasks.submit(ResumableTasks.java:31)
at io.prestosql.plugin.hive.BackgroundHiveSplitLoader.start(BackgroundHiveSplitLoader.java:192)
at io.prestosql.plugin.hive.HiveSplitManager.getSplits(HiveSplitManager.java:261)
at io.prestosql.plugin.base.classloader.ClassLoaderSafeConnectorSplitManager.getSplits(ClassLoaderSafeConnectorSplitManager.java:53)
at io.prestosql.split.SplitManager.getSplits(SplitManager.java:86)
at io.prestosql.sql.planner.DistributedExecutionPlanner$Visitor.visitScanAndFilter(DistributedExecutionPlanner.java:189)
at io.prestosql.sql.planner.DistributedExecutionPlanner$Visitor.visitTableScan(DistributedExecutionPlanner.java:172)
at io.prestosql.sql.planner.DistributedExecutionPlanner$Visitor.visitTableScan(DistributedExecutionPlanner.java:149)
at io.prestosql.sql.planner.plan.TableScanNode.accept(TableScanNode.java:131)
at io.prestosql.sql.planner.DistributedExecutionPlanner$Visitor.visitLimit(DistributedExecutionPlanner.java:360)
at io.prestosql.sql.planner.DistributedExecutionPlanner$Visitor.visitLimit(DistributedExecutionPlanner.java:149)
at io.prestosql.sql.planner.plan.LimitNode.accept(LimitNode.java:109)
at io.prestosql.sql.planner.DistributedExecutionPlanner.doPlan(DistributedExecutionPlanner.java:119)
at io.prestosql.sql.planner.DistributedExecutionPlanner.doPlan(DistributedExecutionPlanner.java:124)
at io.prestosql.sql.planner.DistributedExecutionPlanner.plan(DistributedExecutionPlanner.java:96)
at io.prestosql.execution.SqlQueryExecution.planDistribution(SqlQueryExecution.java:413)
at io.prestosql.execution.SqlQueryExecution.start(SqlQueryExecution.java:337)
at io.prestosql.execution.SqlQueryManager.createQuery(SqlQueryManager.java:237)
at io.prestosql.dispatcher.LocalDispatchQuery.lambda$startExecution$7(LocalDispatchQuery.java:132)
at io.prestosql.$gen.Presto_332____20200507_223637_2.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)

electrum commented 4 years ago

BoundedExecutor gets into a failed state because the underlying executor refused task creation. This should only be possible due to some combination of a misconfiguration (sum of thread pool maximums are too large) and/or bugs (unbounded thread creation).

I recommend running with https://github.com/airlift/jvmkill as this will immediately kill the JVM if it cannot create a thread.

Note that this doesn't seem to have anything to do with register_partition specifically, as that operation seems to execute entirely within the calling thread. This is more likely somewhere in the coordinator during execution.

Can you show the config properties for the coordinator and Hive connector?

awishnick commented 4 years ago

I've attached my config. Thanks for the tip on jvmkill.

For what it's worth, I am also seeing this issue manifest as the client getting read timeouts when running queries, i.e. the coordinator never responds. I also see workers becoming unresponsive to the point that launcher stop does nothing. This is presto 333 fwiw.

Coordinator config:

coordinator=true
node-scheduler.include-coordinator=false
discovery-server.enabled=true

http-server.http.port=8080
discovery.uri=http://xxxx:8080

query.max-memory=3000GB
query.max-total-memory=18000GB
query.max-memory-per-node=12GB
query.max-total-memory-per-node=36GB

query.max-execution-time=20m
query.max-run-time=30m

optimizer.join-reordering-strategy=AUTOMATIC
join-distribution-type=AUTOMATIC

connector.name=hive-hadoop2
hive.metastore.uri=thrift://xxxxx:9083
hive.metastore-timeout=3m
hive.s3.aws-access-key=xxxxx
hive.s3.aws-secret-key=xxxxx
hive.s3.endpoint=xxxxx
hive.s3.ssl.enabled=false
hive.s3.path-style-access=true

hive.parquet.use-column-names=true
hive.orc.use-column-names=true

hive.allow-drop-table=true
hive.allow-rename-table=true
hive.allow-register-partition-procedure=true

hive.create-empty-bucket-files=false

hive.non-managed-table-writes-enabled=true

hive.max-partitions-per-writers=5000

awishnick commented 4 years ago

Ok, an update -- even with appropriate exponential backoff, if I hit the coordinator with ~3000 register_partition queries, it becomes permanently unresponsive. To the client this shows up as HTTP connection read timeouts. I also can't connect to the web UI. I also see the coordinator process at 100% CPU. While the coordinator is unresponsive, I see failures like this in server.log:

2020-05-08T17:30:42.608-0400    ERROR   http-client-scheduler-4711  com.google.common.util.concurrent.AbstractFuture    RuntimeException while executing runnable CallbackListener{io.prestosql.server.remotetask.SimpleHttpResponseHandler@7ef1e726} with executor io.airlift.concurrent.BoundedExecutor@27a6117c
java.lang.IllegalStateException: BoundedExecutor is in a failed state
    at com.google.common.base.Preconditions.checkState(Preconditions.java:508)
    at io.airlift.concurrent.BoundedExecutor.execute(BoundedExecutor.java:55)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1174)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:969)
    at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:738)
    at io.airlift.http.client.jetty.JettyResponseFuture.completed(JettyResponseFuture.java:84)
    at io.airlift.http.client.jetty.BufferingResponseListener.onComplete(BufferingResponseListener.java:90)
    at org.eclipse.jetty.client.ResponseNotifier.notifyComplete(ResponseNotifier.java:218)
    at org.eclipse.jetty.client.ResponseNotifier.notifyComplete(ResponseNotifier.java:210)
    at org.eclipse.jetty.client.HttpReceiver.terminateResponse(HttpReceiver.java:543)
    at org.eclipse.jetty.client.HttpReceiver.terminateResponse(HttpReceiver.java:523)
    at org.eclipse.jetty.client.HttpReceiver.responseSuccess(HttpReceiver.java:486)
    at org.eclipse.jetty.client.http.HttpReceiverOverHTTP.messageComplete(HttpReceiverOverHTTP.java:326)
    at org.eclipse.jetty.http.HttpParser.handleContentMessage(HttpParser.java:580)
    at org.eclipse.jetty.http.HttpParser.parseContent(HttpParser.java:1697)
    at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:1526)
    at org.eclipse.jetty.client.http.HttpReceiverOverHTTP.parse(HttpReceiverOverHTTP.java:200)
    at org.eclipse.jetty.client.http.HttpReceiverOverHTTP.process(HttpReceiverOverHTTP.java:141)
    at org.eclipse.jetty.client.http.HttpReceiverOverHTTP.receive(HttpReceiverOverHTTP.java:75)
    at org.eclipse.jetty.client.http.HttpChannelOverHTTP.receive(HttpChannelOverHTTP.java:133)
    at org.eclipse.jetty.client.http.HttpConnectionOverHTTP.onFillable(HttpConnectionOverHTTP.java:156)
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
    at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
    at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:375)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
    at java.base/java.lang.Thread.run(Thread.java:834)

awishnick commented 4 years ago

From what I can tell I'm hitting up against the user process limit, which was 5k, but I bumped it to 15k and then 32k. During this test I've observed presto's process tree having > 8k processes. This is on a 12 core machine. Is that expected?

tooptoop4 commented 3 years ago

yes, go for max ulimit

erichwang commented 3 years ago

Do we happen to know what was the original exception that was thrown by the underlying ExecutorService? It should be preceded by an error log that looks like: "BoundedExecutor state corrupted due to underlying executor failure"

trinodb / trino

DOS attack system.register_partition() can get BoundedExecutor into an unrecoverable state #3674