Open gjhkael opened 3 years ago
cc @arhimondr who is leading Presto-on-Spark.
Thanks for the CC @wenleix !
@gjhkael Presto is not very reliable for long running jobs due to lack of failure recovery. Internally we are working on Presto on Spark as a solution for long running, high memory Presto SQL jobs: https://prestodb.io/docs/current/installation/spark.html. It requires Spark deployment to be run. If you don't have Spark deployment you should try Presto Unlimited that offers map-reduce like execution mode fore Presto with limited failure recovery capabilities: https://prestodb.io/blog/2019/08/05/presto-unlimited-mpp-database-at-scale
In our production environment, we have a lot of etl job, some job are big query that cost a lot of resources. When one worker crash, if the query has running task on the node, then the query failed immediately. After failed, the query is submitted again from client. This situation, not only a waste of cluster resources, and easy to cause cluster avalanche.