verdict-project / verdict

Interactive-Speed Analytics: 200x Faster, 200x Fewer Cluster Resources, Approximate Query Processing
http://verdictdb.org
Apache License 2.0
248 stars 66 forks source link

How to cache scramble tables in Spark? #362

Open hychen20 opened 5 years ago

pyongjoo commented 5 years ago

The standard caching statement [1] should work when prefixed with bypass. For example.

verdict.sql('bypass cache table schema.scramble_table')

Disclaimer: We have not tested this yet, so I am not 100% certain.

[1] https://docs.databricks.com/spark/latest/spark-sql/language-manual/cache-table.html

hychen20 commented 5 years ago

Sorry, it seems it does not work. I cached the scramble lineitem table as well as the verdictdbmeta table. I can see the tables are cached from the Spark UI, however, the TPC-H Q1 still takes the same amount of time as when the tables are not cached ...

Here's my code:

  verdict.setDefaultSchema(schema) // tpch1g
  verdict.sql("bypass cache table lineitem")
  verdict.sql("bypass cache table orders")
  verdict.sql("bypass cache table verdictdbmeta.verdictdbmeta")
  verdict.sql("bypass cache table lineitem_scramble")
  verdict.sql("bypass cache table orders_scramble")
  val q_verdict = spark.sparkContext.getConf.get("spark.verdictdb.query") // Q1, Q6, or Q14
  val rs_verdict = verdict.sql(q_verdict)
  rs_verdict.print()