Long running job failed due to single node crash or long full gc

gjhkael commented 3 years ago

In our production environment, we have a lot of etl job, some job are big query that cost a lot of resources. When one worker crash, if the query has running task on the node, then the query failed immediately. After failed, the query is submitted again from client. This situation, not only a waste of cluster resources, and easy to cause cluster avalanche.

wenleix commented 3 years ago

cc @arhimondr who is leading Presto-on-Spark.

arhimondr commented 3 years ago

Thanks for the CC @wenleix !

@gjhkael Presto is not very reliable for long running jobs due to lack of failure recovery. Internally we are working on Presto on Spark as a solution for long running, high memory Presto SQL jobs: https://prestodb.io/docs/current/installation/spark.html. It requires Spark deployment to be run. If you don't have Spark deployment you should try Presto Unlimited that offers map-reduce like execution mode fore Presto with limited failure recovery capabilities: https://prestodb.io/blog/2019/08/05/presto-unlimited-mpp-database-at-scale

prestodb / presto

Long running job failed due to single node crash or long full gc #15928