microsoft / lst-bench

LST-Bench is a framework that allows users to run benchmarks specifically designed for evaluating Log-Structured Tables (LSTs) such as Delta Lake, Apache Hudi, and Apache Iceberg.
Apache License 2.0
62 stars 34 forks source link

Databricks execution error #216

Closed manucorujo closed 7 months ago

manucorujo commented 8 months ago

We are trying to execute the benchmark on a Databricks serverless SQL warehouse using the Databricks JDBC connector, and we are facing the following problem:

databricks

The error says that the path does not exist (as we marked with a green arrow), but the other arrow shows that actually it does exist.

We think the problem comes from our configuration of the experiment_config.yaml file. The file contains the following parameters:

---
version: 1
id: test_1
repetitions: 1

metadata:
  system: spark
  system_version: 3.3.1
  table_format: delta
  table_format_version: 2.2.0
  scale_factor: 1
  mode: cow

parameter_values:
  external_catalog: spark_catalog
  external_database: external_tpcds
  external_table_format: csv
  external_data_path: 'dbfs:/user/hive/warehouse/external_tpcds.db/'
  external_options_suffix: ',header="true"'
  external_tblproperties_suffix: ''
  catalog: spark_catalog
  database: delta_tpcds
  table_format: delta
  scale_factor: 1
  data_path: 'dbfs:/user/hive/warehouse/delta_tpcds.db/'
  options_suffix: ''
  tblproperties_suffix: ''

We are completing this file based on the info provided by the sample_experiment_config.yaml file and the info about the parameters contained in the experiment_config.json file. One problem is that this json file says what is the metadata param, but not all the subparams that it can contain and all their information.

jcamachor commented 8 months ago

@manucorujo , thanks for reporting this. I see the error is in the recover partitions statement. Quick question: Is the CSV data in the external tables partitioned by same columns specified in the SQL scripts, i.e., https://github.com/microsoft/lst-bench/blob/main/src/main/resources/scripts/tpcds/setup/spark/ddl-external-tables.sql ? Our internal tool generates the CSV partitioned according to those columns, and thus, it carried to the framework SQL script; if the data you generated is not partitioned, you can probably remove the partitioning spec from that script.

manucorujo commented 8 months ago

Thanks! This solved our doubt. Although, we are still not able to execute the full benchmark.

We are trying to execute the following predefined tasks: setup, setup_data_maintenance, init, build, single_user (the ones defined in the sample workload file). Seeing the Databricks web UI we can check that it creates the external schema and all the 24 TPC-DS tables in it. But after this, it creates some tables with the suffix "_000001", "_000002", "_000004" and "_000005". We assume that it uses these tables in data maintenance, but why don't know how.

After this, it creates the delta tables (deleting them first if they exist). The tables are well created, and then tries to populate them by copying into them the rows present in the external tables with a query like this:

INSERT INTO
  spark_catalog.delta_tpcds.catalog_sales
SELECT
  *
FROM

  spark_catalog.external_tpcds.catalog_sales

The query is successfully executed, but we suspect that the external tables are empty, because the result of the query is the following: query_databricks

So, we wonder when the external tables should have been filled with data, because looking at the executions in Databricks at no time has this happened (we imagine that it is our mistake, and we must indicate it in some phase or task, but we don't know where).

Also, when all the INSERT INTO statements into the delta tables are executed, it tries to execute the SQL queries of the TPC-DS benchmark. At first, it executes the following:

SELECT
  COUNT(*)
FROM
  spark_catalog.delta_tpcds.store_sales,
  spark_catalog.delta_tpcds.household_demographics,
  spark_catalog.delta_tpcds.time_dim,
  spark_catalog.delta_tpcds.store
WHERE
  ss_sold_time_sk = time_dim.t_time_sk
  AND ss_hdemo_sk = household_demographics.hd_demo_sk
  AND ss_store_sk = s_store_sk
  AND time_dim.t_hour = 8
  AND time_dim.t_minute >= 30
  AND household_demographics.hd_dep_count = 5
  AND store.s_store_name = 'ese'
ORDER BY
  COUNT(*)
LIMIT
  100

Databricks shows the execution as successfully, but it does not return any row and also the launcher script breaks with the following error (and Databricks does not execute more queries):

ERROR StatusLogger Unable to create Lookup for bundle java.lang.ClassCastException: class org.apache.logging.log4j.core.lookup.ResourceBundleLookup at java.base/java.lang.Class.asSubclass(Class.java:4102) at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.lookup.Interpolator.(Interpolator.java:84) at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.lookup.Interpolator.(Interpolator.java:105) at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.config.AbstractConfiguration.(AbstractConfiguration.java:135) at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.config.NullConfiguration.(NullConfiguration.java:32) at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.LoggerContext.(LoggerContext.java:74) at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.selector.ClassLoaderContextSelector.createContext(ClassLoaderContextSelector.java:254) at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.selector.ClassLoaderContextSelector.getDefault(ClassLoaderContextSelector.java:266) at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.selector.ClassLoaderContextSelector.getContext(ClassLoaderContextSelector.java:146) at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.selector.ClassLoaderContextSelector.getContext(ClassLoaderContextSelector.java:123)

The error is very long and contains many more lines, but we think they are not useful (if you need them, we can send them to you, of course). We found this discussion in a Databricks forum, but we believe that the error is related to the tables being empty, and not to what is indicated in the forum. Also, an answer was given but the user who asked the question did not mark it as definitive.

manucorujo commented 7 months ago

We discovered that when using the default tasks, the execution deletes the external tables and recreates them as empty. As a result, the first SELECT query outputs 0 rows, which is expected due to the tables being empty. However, when executing only the 'single_user' task and creating the external tables beforehand, the query returns rows. We did not find at the documentation clear guidance on data generation, leaving us uncertain about the appropriate approach, timing, and tools to use for generating data. It is unclear whether the LST tool generates the data or if it is our responsibility to use our own tools (or official TPC ones) to generate it. We opted to generate the external tables and data using the official TPC tools (dsdgen for data generation), and then we skipped the 'setup' and 'setup_data_maintenance' tasks to avoid flushing the external tables. We only executed the 'init', 'build' and 'single_user' tasks. This last one is the one which contains the SELECT queries.

Using the Databricks web UI, we can see that the first SELECT query is executing correctly, reading various bytes of data. However, the benchmark execution suddenly stops after the first query execution. The error message is extensive, but the following excerpt is particularly informative (this excerpt is extracted from the ERROR we reported in the previous message):

2024-01-11T11:51:40,949 ERROR [pool-2-thread-1] task.TaskExecutor: Exception executing statement: query96.sql_0, statement text: SELECT
    COUNT(*)
FROM
    spark_catalog.delta_tpcds.store_sales ${asof},
    spark_catalog.delta_tpcds.household_demographics,
    spark_catalog.delta_tpcds.time_dim,
    spark_catalog.delta_tpcds.store
WHERE
    ss_sold_time_sk = time_dim.t_time_sk
    AND ss_hdemo_sk = household_demographics.hd_demo_sk
    AND ss_store_sk = s_store_sk
    AND time_dim.t_hour = 8
    AND time_dim.t_minute >= 30
    AND household_demographics.hd_dep_count = 5
    AND store.s_store_name = 'ese'
ORDER BY
    COUNT(*) LIMIT 100; error message: Query execution (0 retries) unsuccessful; stack trace: java.sql.SQLException: [Databricks][JDBCDriver](500540) Error caught in BackgroundFetcher. Foreground thread ID: 33. Background thread ID: 39. Error caught: null.
        at com.databricks.client.hivecommon.dataengine.BackgroundFetcher.run(Unknown Source)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
Caused by: com.databricks.client.support.exceptions.GeneralException: [Databricks][JDBCDriver](500540) Error caught in BackgroundFetcher. Foreground thread ID: 33. Background thread ID: 39. Error caught: null.

The reason for the failure is unclear as Databricks has marked the execution as successful. However, the ERROR message suggests otherwise.

manucorujo commented 7 months ago

We solved the issue by just changing the Java version. We had installed it from Adoptium Java 17, but by changing it to Java 11 the issue was solved.