h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
324 stars 87 forks source link

spark should use .persist() method on each query #25

Closed jangorecki closed 6 years ago

jangorecki commented 6 years ago

To match other tools behavior, otherwise it should be marked as cache=FALSE as Impala and Presto were marked before. also to not suffer from writing to HDD it should persist only to memory, by default it uses RAM and HDD. https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.persist