spark should use .persist() method on each query

h2oai / db-benchmark

reproducible benchmark of database-like ops

https://h2oai.github.io/db-benchmark

Mozilla Public License 2.0

324 stars 87 forks source link

Closed jangorecki closed 6 years ago

jangorecki commented 6 years ago

To match other tools behavior, otherwise it should be marked as cache=FALSE as Impala and Presto were marked before. also to not suffer from writing to HDD it should persist only to memory, by default it uses RAM and HDD. https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.persist