microsoft / lst-bench

LST-Bench is a framework that allows users to run benchmarks specifically designed for evaluating Log-Structured Tables (LSTs) such as Delta Lake, Apache Hudi, and Apache Iceberg.
Apache License 2.0
57 stars 32 forks source link

Support for the Snowflake Cloud Data Platform #265

Open agr17 opened 2 months ago

agr17 commented 2 months ago

Support for the Snowflake Cloud Data Platform

Introduction

Hello everyone! In our project, called CLADE, we are analysing and comparing the features of current data lakehouse platforms, specifically using Databricks and Snowflake. Databricks uses Delta as its default table format and Snowflake has recently added support for Iceberg Tables.

While researching these data formats we discovered LST-Bench, and wanted to use it to measure the efficiency of Delta in Databricks and Iceberg in Snowflake. For the latter platform we need to add specific SQL queries and yaml files in the run folder. We are pleased to offer our support for this project by providing support for Snowflake.

New run configuration

We suggest a new run folder for Snowflake with its own config folder containing the sample yaml files and a scripts folder containing the necessary sql statements. For the connection to Snowflake it is as simple as adding the corresponding JDBC Driver to pom.xml.

Scripts can have two build options. One to use the Snowflake default tables and the other one for the new Iceberg tables. The other tasks will be adapted to Snowflake syntax, for example with the use of Stages for data loading.

Limitations

It is not possible to run the Optimise task on Snowflake Managed Iceberg tables. The REFRESH function is only available for externally managed tables, for example in AWS Glue (see documentation). For Snowflake Managed Iceberg Tables, the metadata refresh is coordinated internally by Snowflake. If you try to use Refresh on it, the next error is displayed:

ALTER command failed. The provided table must be an Iceberg table with an external catalog integration to perform the command ICEBERG_TABLE_REFRESH. The type of table ICEBERG_TABLE is MANAGED

Therefore, we can only run the longevity, concurrency and time travel workloads. In the case of concurrency, the Optimise functions can be replaced by Data Mainteance functions.


Finally, just remember that we are happy to develop this support for Snowflake following the previous approach and your feedback, which is welcome.

jcamachor commented 1 month ago

@agr17 , this is a very interesting proposal, we'd like to include support for other popular engines in LST-Bench. Please let us know if you hit any blockers. Looking forward to your contribution!