Closed manucorujo closed 7 months ago
@manucorujo , thanks for reporting this. I see the error is in the recover partitions
statement. Quick question: Is the CSV data in the external tables partitioned by same columns specified in the SQL scripts, i.e., https://github.com/microsoft/lst-bench/blob/main/src/main/resources/scripts/tpcds/setup/spark/ddl-external-tables.sql ? Our internal tool generates the CSV partitioned according to those columns, and thus, it carried to the framework SQL script; if the data you generated is not partitioned, you can probably remove the partitioning spec from that script.
Thanks! This solved our doubt. Although, we are still not able to execute the full benchmark.
We are trying to execute the following predefined tasks: setup, setup_data_maintenance, init, build, single_user (the ones defined in the sample workload file). Seeing the Databricks web UI we can check that it creates the external schema and all the 24 TPC-DS tables in it. But after this, it creates some tables with the suffix "_000001", "_000002", "_000004" and "_000005". We assume that it uses these tables in data maintenance, but why don't know how.
After this, it creates the delta tables (deleting them first if they exist). The tables are well created, and then tries to populate them by copying into them the rows present in the external tables with a query like this:
INSERT INTO
spark_catalog.delta_tpcds.catalog_sales
SELECT
*
FROM
spark_catalog.external_tpcds.catalog_sales
The query is successfully executed, but we suspect that the external tables are empty, because the result of the query is the following:
So, we wonder when the external tables should have been filled with data, because looking at the executions in Databricks at no time has this happened (we imagine that it is our mistake, and we must indicate it in some phase or task, but we don't know where).
Also, when all the INSERT INTO
statements into the delta tables are executed, it tries to execute the SQL queries of the TPC-DS benchmark. At first, it executes the following:
SELECT
COUNT(*)
FROM
spark_catalog.delta_tpcds.store_sales,
spark_catalog.delta_tpcds.household_demographics,
spark_catalog.delta_tpcds.time_dim,
spark_catalog.delta_tpcds.store
WHERE
ss_sold_time_sk = time_dim.t_time_sk
AND ss_hdemo_sk = household_demographics.hd_demo_sk
AND ss_store_sk = s_store_sk
AND time_dim.t_hour = 8
AND time_dim.t_minute >= 30
AND household_demographics.hd_dep_count = 5
AND store.s_store_name = 'ese'
ORDER BY
COUNT(*)
LIMIT
100
Databricks shows the execution as successfully, but it does not return any row and also the launcher script breaks with the following error (and Databricks does not execute more queries):
ERROR StatusLogger Unable to create Lookup for bundle java.lang.ClassCastException: class org.apache.logging.log4j.core.lookup.ResourceBundleLookup at java.base/java.lang.Class.asSubclass(Class.java:4102) at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.lookup.Interpolator.
(Interpolator.java:84) at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.lookup.Interpolator. (Interpolator.java:105) at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.config.AbstractConfiguration. (AbstractConfiguration.java:135) at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.config.NullConfiguration. (NullConfiguration.java:32) at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.LoggerContext. (LoggerContext.java:74) at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.selector.ClassLoaderContextSelector.createContext(ClassLoaderContextSelector.java:254) at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.selector.ClassLoaderContextSelector.getDefault(ClassLoaderContextSelector.java:266) at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.selector.ClassLoaderContextSelector.getContext(ClassLoaderContextSelector.java:146) at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.selector.ClassLoaderContextSelector.getContext(ClassLoaderContextSelector.java:123)
The error is very long and contains many more lines, but we think they are not useful (if you need them, we can send them to you, of course). We found this discussion in a Databricks forum, but we believe that the error is related to the tables being empty, and not to what is indicated in the forum. Also, an answer was given but the user who asked the question did not mark it as definitive.
We discovered that when using the default tasks, the execution deletes the external tables and recreates them as empty. As a result, the first SELECT query outputs 0 rows, which is expected due to the tables being empty. However, when executing only the 'single_user' task and creating the external tables beforehand, the query returns rows. We did not find at the documentation clear guidance on data generation, leaving us uncertain about the appropriate approach, timing, and tools to use for generating data. It is unclear whether the LST tool generates the data or if it is our responsibility to use our own tools (or official TPC ones) to generate it. We opted to generate the external tables and data using the official TPC tools (dsdgen for data generation), and then we skipped the 'setup' and 'setup_data_maintenance' tasks to avoid flushing the external tables. We only executed the 'init', 'build' and 'single_user' tasks. This last one is the one which contains the SELECT queries.
Using the Databricks web UI, we can see that the first SELECT query is executing correctly, reading various bytes of data. However, the benchmark execution suddenly stops after the first query execution. The error message is extensive, but the following excerpt is particularly informative (this excerpt is extracted from the ERROR we reported in the previous message):
2024-01-11T11:51:40,949 ERROR [pool-2-thread-1] task.TaskExecutor: Exception executing statement: query96.sql_0, statement text: SELECT
COUNT(*)
FROM
spark_catalog.delta_tpcds.store_sales ${asof},
spark_catalog.delta_tpcds.household_demographics,
spark_catalog.delta_tpcds.time_dim,
spark_catalog.delta_tpcds.store
WHERE
ss_sold_time_sk = time_dim.t_time_sk
AND ss_hdemo_sk = household_demographics.hd_demo_sk
AND ss_store_sk = s_store_sk
AND time_dim.t_hour = 8
AND time_dim.t_minute >= 30
AND household_demographics.hd_dep_count = 5
AND store.s_store_name = 'ese'
ORDER BY
COUNT(*) LIMIT 100; error message: Query execution (0 retries) unsuccessful; stack trace: java.sql.SQLException: [Databricks][JDBCDriver](500540) Error caught in BackgroundFetcher. Foreground thread ID: 33. Background thread ID: 39. Error caught: null.
at com.databricks.client.hivecommon.dataengine.BackgroundFetcher.run(Unknown Source)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
Caused by: com.databricks.client.support.exceptions.GeneralException: [Databricks][JDBCDriver](500540) Error caught in BackgroundFetcher. Foreground thread ID: 33. Background thread ID: 39. Error caught: null.
The reason for the failure is unclear as Databricks has marked the execution as successful. However, the ERROR message suggests otherwise.
We solved the issue by just changing the Java version. We had installed it from Adoptium Java 17, but by changing it to Java 11 the issue was solved.
We are trying to execute the benchmark on a Databricks serverless SQL warehouse using the Databricks JDBC connector, and we are facing the following problem:
The error says that the path does not exist (as we marked with a green arrow), but the other arrow shows that actually it does exist.
We think the problem comes from our configuration of the
experiment_config.yaml
file. The file contains the following parameters:We are completing this file based on the info provided by the sample_experiment_config.yaml file and the info about the parameters contained in the experiment_config.json file. One problem is that this json file says what is the metadata param, but not all the subparams that it can contain and all their information.