microsoft / lst-bench

LST-Bench is a framework that allows users to run benchmarks specifically designed for evaluating Log-Structured Tables (LSTs) such as Delta Lake, Apache Hudi, and Apache Iceberg.
Apache License 2.0
69 stars 35 forks source link

Snowflake support #310

Closed agr17 closed 3 months ago

agr17 commented 3 months ago

This is a proposal to add snowflake support to LST-Bench. These are the main changes:

With the current content of this PR you can run all workloads for Snowflake. For the time being, no documentation (sample yaml files are available) or tests have been added, I prefer to wait for your feedback and recommendations.

Fix #265

agr17 commented 3 months ago

@microsoft-github-policy-service agree company="Universidade da Coruña, A Coruña, Spain"

jcamachor commented 3 months ago

Thanks, @agr17 ! Structure looks good to me and is consistent with the implementation for other engines such as Spark and Trino. I think you currently evaluate on native tables. Do you plan to add support for Iceberg tables as well? I think the syntax for that would be CREATE ICEBERG TABLE, so it might be easy to do it by modifying the build SQL scripts to add a variable and then passing a parameter value--empty for native tables and ICEBERG for Iceberg tables. Also, small nit: please update the README.md file to reflect the new profile in the pom.xml file.

agr17 commented 3 months ago

Hi @jcamachor , thanks for your feedback! Yes, I was planning to add support for Iceberg tables, my idea was to add it in another Issue/PR.

I had done it, but it is not enough to change one parameter. The data types in Iceberg tables are different from Snowflake types (see https://docs.snowflake.com/en/user-guide/tables-iceberg-data-types). So I created a build_iceberg folder with the necessary SQL statements and added it to a task_template in library.yaml called build_iceberg. Additional parameters are required, external_volume, which is needed in Snowflake to create Iceberg tables, and base_location, which specifies the folder on the external_volume where the table should be created.

Iceberg table support is available in commit https://github.com/microsoft/lst-bench/pull/310/commits/e01a7962b4934e33601fd4c12d6595e6fe2dfaa5 (sorry for the name of the commit, it was an error). There are the build_iceberg folder, the corresponding tasks in library.yaml and the new parameters for the iceberg tables (exvol and base_location) in sample_experiment_config.yaml. This was done in a preliminary way, if you prefer a different approach to be in line with the rest of the project, I'm open to any changes.

jcamachor commented 3 months ago

@agr17 , this looks good, thanks! (Given the current LST-Bench framework, I do not think there is a better approach.) Since the SQL for the second step (inserting into the tables) is the same in both native and Iceberg, you might consider using a single copy of that step to reduce some of the duplication.

agr17 commented 3 months ago

Hi @jcamachor . I have fixed the code redundancy using just one build folder with two subfolders, one for native tables and another for iceberg tables. The insert queries are the same files for both in the main build folder.

jcamachor commented 3 months ago

Merged to main, thanks for your contribution @agr17 !