prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
15.97k stars 5.35k forks source link

Long running job failed due to single node crash or long full gc #15928

Open gjhkael opened 3 years ago

gjhkael commented 3 years ago

In our production environment, we have a lot of etl job, some job are big query that cost a lot of resources. When one worker crash, if the query has running task on the node, then the query failed immediately. After failed, the query is submitted again from client. This situation, not only a waste of cluster resources, and easy to cause cluster avalanche.

wenleix commented 3 years ago

cc @arhimondr who is leading Presto-on-Spark.

arhimondr commented 3 years ago

Thanks for the CC @wenleix !

@gjhkael Presto is not very reliable for long running jobs due to lack of failure recovery. Internally we are working on Presto on Spark as a solution for long running, high memory Presto SQL jobs: https://prestodb.io/docs/current/installation/spark.html. It requires Spark deployment to be run. If you don't have Spark deployment you should try Presto Unlimited that offers map-reduce like execution mode fore Presto with limited failure recovery capabilities: https://prestodb.io/blog/2019/08/05/presto-unlimited-mpp-database-at-scale