prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16.06k stars 5.38k forks source link

Native CAST performance slower than Java #22346

Open aaneja opened 7 months ago

aaneja commented 7 months ago

Spawning a new issue for the comment on https://github.com/prestodb/presto/issues/22184#issuecomment-1996569271

Experiment with and without CAST

I was looking at possible causes for the latency difference observed between Native & Java clusters for TPCDS Q23 (SF 10K) and observed the below, w.r.t performance of the CAST operator

On a Native cluster

Measure read speed of the integer column ss_quantity read as-is

presto:tpcds_sf10000_parquet_varchar> explain analyze SELECT ss_quantity FROM store_sales;                                   -> ;
                                                                                                                                                      Query Plan                                                                                   >
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------->
 Fragment 1 [SOURCE]                                                                                                                                                                                                                               >
     CPU: 16.68m, Scheduled: 1.19h, Input: 28,799,864,615 rows (135.44GB); per task: avg.: 1,799,991,538.44 std.dev.: 1,135,109,560.01, Output: 28,799,864,615 rows (105.94GB), 16 tasks                                                           >
     Output layout: [ss_quantity]                                                                                                                                                                                                                  >
     Output partitioning: SINGLE []                                                                                                                                                                                                                >
     Stage Execution Strategy: UNGROUPED_EXECUTION                                                                                                                                                                                                 >
     - TableScan[PlanNodeId 0][TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=tpcds_sf10000_parquet_varchar, tableName=store_sales, analyzePartitionValues=Optional.empty}', layout='Optional[tpcds_sf10000_parquet_v>
             CPU: 16.68m (100.00%), Scheduled: 1.19h (100.00%), Output: 28,799,864,615 rows (105.94GB)                                                                                                                                             >
             Input avg.: 56,249,735.58 rows, Input std.dev.: 661.25%                                                                                                                                                                               >
             LAYOUT: tpcds_sf10000_parquet_varchar.store_sales{}                                                                                                                                                                                   >
             ss_quantity := ss_quantity:int:10:REGULAR (1:41)                                                                                                                                                                                      >
             Input: 28,799,864,615 rows (135.44GB), Filtered: 0.00%                                                                                                                                                                                >
                                                                                                                                                                                                                                                   >
                                                                                                                                                                                                                                                   >
(1 row)

Query 20240314_043226_00018_kvcgs, FINISHED, 17 nodes
Splits: 7,771 total, 7,771 done (100.00%)
[Latency: client-side: 1:42, server-side: 1:42] [28.8B rows, 1.49GB] [283M rows/s, 14.9MB/s]

Compare this against the read speed when we are forced to CAST the column to a decimal

presto:tpcds_sf10000_parquet_varchar> explain analyze SELECT CAST(ss_quantity AS decimal(10,0)) FROM store_sales;
                                                                                                                                                                        Query Plan                                                                 >
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------->
 Fragment 1 [SOURCE]                                                                                                                                                                                                                               >
     CPU: 25.40m, Scheduled: 1.31h, Input: 28,799,864,615 rows (135.44GB); per task: avg.: 1,799,991,538.44 std.dev.: 1,500,572,958.27, Output: 28,799,864,615 rows (208.40GB), 16 tasks                                                           >
     Output layout: [expr]                                                                                                                                                                                                                         >
     Output partitioning: SINGLE []                                                                                                                                                                                                                >
     Stage Execution Strategy: UNGROUPED_EXECUTION                                                                                                                                                                                                 >
     - ScanProject[PlanNodeId 0,1][table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=tpcds_sf10000_parquet_varchar, tableName=store_sales, analyzePartitionValues=Optional.empty}', layout='Optional[tpcds_sf100>
             CPU: 15.87m (62.49%), Scheduled: 16.41m (20.92%), Output: 28,799,864,615 rows (208.40GB)                                                                                                                                              >
             Input avg.: 56,249,735.58 rows, Input std.dev.: 729.65%                                                                                                                                                                               >
             expr := CAST(ss_quantity AS decimal(10,0)) (1:65)                                                                                                                                                                                     >
             LAYOUT: tpcds_sf10000_parquet_varchar.store_sales{}                                                                                                                                                                                   >
             ss_quantity := ss_quantity:int:10:REGULAR (1:64)                                                                                                                                                                                      >
             Input: 28,799,864,615 rows (135.44GB), Filtered: 0.00%                                                                                                                                                                                >
                                                                                                                                                                                                                                                   >
                                                                                                                                                                                                                                                   >
(1 row)

Query 20240314_043432_00019_kvcgs, FINISHED, 17 nodes
Splits: 7,770 total, 7,770 done (100.00%)
[Latency: client-side: 3:24, server-side: 3:23] [28.8B rows, 1.51GB] [142M rows/s, 7.61MB/s]

On a Java cluster

Measure read speed of the integer column ss_quantity read as-is

presto> use hive.tpcds_sf10000_parquet_varchar;
USE
presto:tpcds_sf10000_parquet_varchar> explain analyze SELECT ss_quantity FROM store_sales;
                                                                                                                                                      Query Plan                                                                                   >
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------->
 Fragment 1 [SOURCE]                                                                                                                                                                                                                               >
     CPU: 32.10m, Scheduled: 3.06h, Input: 28,799,864,615 rows (29.70GB); per task: avg.: 1,799,991,538.44 std.dev.: 158,762,556.69, Output: 28,799,864,615 rows (134.11GB), 16 tasks                                                              >
     Output layout: [ss_quantity]                                                                                                                                                                                                                  >
     Output partitioning: SINGLE []                                                                                                                                                                                                                >
     Stage Execution Strategy: UNGROUPED_EXECUTION                                                                                                                                                                                                 >
     - TableScan[PlanNodeId 0][TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=tpcds_sf10000_parquet_varchar, tableName=store_sales, analyzePartitionValues=Optional.empty}', layout='Optional[tpcds_sf10000_parquet_v>
             CPU: 32.09m (100.00%), Scheduled: 4.51h (100.00%), Output: 28,799,864,615 rows (134.11GB)                                                                                                                                             >
             Input avg.: 3,714,194.56 rows, Input std.dev.: 19.14%                                                                                                                                                                                 >
             LAYOUT: tpcds_sf10000_parquet_varchar.store_sales{}                                                                                                                                                                                   >
             ss_quantity := ss_quantity:int:10:REGULAR (1:41)                                                                                                                                                                                      >
             Input: 28,799,864,615 rows (29.69GB), Filtered: 0.00%                                                                                                                                                                                 >
                                                                                                                                                                                                                                                   >
                                                                                                                                                                                                                                                   >
(1 row)

Query 20240314_045541_00011_epfyg, FINISHED, 17 nodes
Splits: 7,771 total, 7,771 done (100.00%)
[Latency: client-side: 3:41, server-side: 3:41] [28.8B rows, 29.7GB] [131M rows/s, 138MB/s]

Compare this against the read speed when we are forced to CAST the column to a decimal

presto:tpcds_sf10000_parquet_varchar> explain analyze SELECT CAST(ss_quantity AS decimal(10,0)) FROM store_sales;
                                                                                                                                                                        Query Plan                                                                 >
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------->
 Fragment 1 [SOURCE]                                                                                                                                                                                                                               >
     CPU: 33.14m, Scheduled: 1.91h, Input: 28,799,864,615 rows (29.71GB); per task: avg.: 1,799,991,538.44 std.dev.: 92,245,103.31, Output: 28,799,864,615 rows (241.40GB), 16 tasks                                                               >
     Output layout: [expr]                                                                                                                                                                                                                         >
     Output partitioning: SINGLE []                                                                                                                                                                                                                >
     Stage Execution Strategy: UNGROUPED_EXECUTION                                                                                                                                                                                                 >
     - ScanProject[PlanNodeId 0,1][table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=tpcds_sf10000_parquet_varchar, tableName=store_sales, analyzePartitionValues=Optional.empty}', layout='Optional[tpcds_sf100>
             CPU: 33.11m (100.00%), Scheduled: 3.13h (100.00%), Output: 28,799,864,615 rows (241.41GB)                                                                                                                                             >
             Input avg.: 3,714,673.63 rows, Input std.dev.: 18.99%                                                                                                                                                                                 >
             expr := CAST(ss_quantity AS decimal(10,0)) (1:65)                                                                                                                                                                                     >
             LAYOUT: tpcds_sf10000_parquet_varchar.store_sales{}                                                                                                                                                                                   >
             ss_quantity := ss_quantity:int:10:REGULAR (1:64)                                                                                                                                                                                      >
             Input: 28,799,864,615 rows (29.69GB), Filtered: 0.00%                                                                                                                                                                                 >
                                                                                                                                                                                                                                                   >
                                                                                                                                                                                                                                                   >
(1 row)

Query 20240314_045950_00012_epfyg, FINISHED, 17 nodes
Splits: 7,770 total, 7,770 done (100.00%)
[Latency: client-side: 3:29, server-side: 3:28] [28.8B rows, 29.7GB] [138M rows/s, 146MB/s]

Possible cause(s)

aaneja commented 7 months ago

Full performance test - https://gist.github.com/aaneja/dc70f655695933b6ff11978120bbebab

karteekmurthys commented 7 months ago

I tried the scan with/without CAST on Native cluster locally. The difference I noticed was that the PartitionedOutput operator output size is almost 2 times in case of CAST vs without CAST. In general this is the most expensive Operation coz of the writes over the network in both cases, it is just the volume of data is more in case of CAST to decimal. From what it looks like the TableScan +FilterProject (which has the CAST in it) takes about the same time compared to no-cast query.

Native No Cast. Native With CAST
Native-No-cast Native-With-Cast

Possible reason for 2x size in output data: ss_quantity is an INTEGER type (4 byte size) in Velox and the decimal variants are short decimals (8 bytes) and long decimals (16 bytes). When ss_quantity is cast to decimal(10,0) the output size doubles since it is cast to short decimal.

aditi-pandit commented 7 months ago

@karteekmurthys : Lets do a microbenchmark comparing CAST(INT -> DECIMAL) performance in Velox to DuckDB and Presto Java to validate the issue and analyze further.

karteekmurthys commented 7 months ago

Here is a benchmark run compared to duckdb.

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================

cast##cast_decimal_as_bigint                              969.12us     1.03K
cast##cast_int_as_short_decimal                           911.29us     1.10K
cast##cast_int_as_long_decimal                            925.29us     1.08K
cast##cast_bigint_as_short_decimal                        906.91us     1.10K
cast##cast_bigint_as_long_decimal                         922.87us     1.08K

Update: The positive relative time implies Velox is slower compared to DuckDb: Here is how the relative time computed:

 trialResults[actualTrials] = std::make_pair(
          max(0.0, double(nsecs.count()) / timeIterData.niter - globalBaseline),
          std::move(timeIterData.userCounters));
karteekmurthys commented 6 months ago

I ran a presto_server with CLion profiler and found that majority of time is indeed spent in PartitionedOutput read/write. The CRC computation is one of the major factors for the slow down in PartitionedOutput operator.

Screenshot 2024-04-17 at 3 55 32 PM

I tried by disabling the crc compute on PrestoSerializedPage and the total time dropped from 25min to 20.40 min. But for the case without CAST in the query, the CRC check seemed have made no impact, the total time is still ~16 min.

Prestissimo With Cast no CRC check ``` presto:tpcds_sf10000_parquet_varchar> explain analyze SELECT CAST(ss_quantity AS decimal(10,0)) FROM store_sales; Query Plan > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> Fragment 1 [SOURCE] > CPU: 20.40m, Scheduled: 48.42m, Input: 28,799,864,615 rows (135.44GB); per task: avg.: 28,799,864,615.00 std.dev.: 0.00, Output: 28,799,864,615 rows (208.40GB), 1 tasks > Output layout: [expr] > Output partitioning: SINGLE [] > Stage Execution Strategy: UNGROUPED_EXECUTION > - ScanProject[PlanNodeId 0,1][table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=tpcds_sf10000_parquet_varchar, tableName=store_sales, analyzePartitionValues=Optional.empty}> CPU: 7.32m (35.90%), Scheduled: 13.66m (28.21%), Output: 28,799,864,615 rows (208.40GB) > Input avg.: 1,799,991,538.44 rows, Input std.dev.: 387.30% > expr := CAST(ss_quantity AS decimal(10,0)) (1:65) > LAYOUT: tpcds_sf10000_parquet_varchar.store_sales{} > ss_quantity := ss_quantity:int:10:REGULAR (1:64) > Input: 28,799,864,615 rows (135.44GB), Filtered: 0.00% > > > (1 row) ```
Prestissimo without CAST and no CRC check ``` presto:tpcds_sf10000_parquet_varchar> explain analyze SELECT ss_quantity FROM store_sales; Query Plan > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> Fragment 1 [SOURCE] > CPU: 16.88m, Scheduled: 42.67m, Input: 28,799,864,615 rows (135.44GB); per task: avg.: 28,799,864,615.00 std.dev.: 0.00, Output: 28,799,864,615 rows (105.94GB), 1 tasks > Output layout: [ss_quantity] > Output partitioning: SINGLE [] > Stage Execution Strategy: UNGROUPED_EXECUTION > - TableScan[PlanNodeId 0][TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=tpcds_sf10000_parquet_varchar, tableName=store_sales, analyzePartitionValues=Optional.empty}', layout='O> CPU: 16.88m (100.00%), Scheduled: 42.67m (100.00%), Output: 28,799,864,615 rows (105.94GB) > Input avg.: 1,799,991,538.44 rows, Input std.dev.: 387.30% > LAYOUT: tpcds_sf10000_parquet_varchar.store_sales{} > ss_quantity := ss_quantity:int:10:REGULAR (1:41) > Input: 28,799,864,615 rows (135.44GB), Filtered: 0.00% > > > (1 row) ```
yingsu00 commented 6 months ago

The issue was true and can be easily observed from the internal perf dashboard. Screenshot 2024-05-15 at 22 35 18

The cost is mainly in FilterProject operator when doing type casts like: image image image

My guess the perf regression lies in two folds:

  1. New memory allocation
  2. Cast Implementation not vectorized.

I can look into how to improve it.

aaneja commented 4 months ago

Observed this with TPCDS Q74 on a run comparing perf between Java and Prestissimo clusters -

Java fragment 13 where we do a CAST(ss_net_paid AS double):

 - Project[PlanNodeId 3858][projectLocality = LOCAL] => [ss_net_paid_12:double, c_customer_id:varchar(16), c_last_name:varchar(30), c_first_name:varchar(20), $hashvalue_928:bigint]
                Estimates: {source: CostBasedSourceInfo, rows: 524,499,895 (50.09GB), cpu: 355,545,124,759.71, memory: 678,625,159.92, network: 12,553,302,851.47}
                CPU: 1.50m (1.73%), Scheduled: 2.49m (1.06%), Output: 534,235,692 rows (30.13GB)
                Input avg.: 4,173,716.34 rows, Input std.dev.: 39.27%
                ss_net_paid_12 := CAST(ss_net_paid AS double) (9:27)
                $hashvalue_928 := combine_hash(combine_hash(combine_hash(BIGINT'0', COALESCE($operator$hash_code(c_customer_id), BIGINT'0')), COALESCE($operator$hash_code(c_first_name), BIGINT'0')), COALESCE($operator$hash_code(c_last_name), BIGINT'0')) (14:30)

uses 1.50m CPU

vs same for Native -

- Project[PlanNodeId 3894][projectLocality = LOCAL] => [ss_net_paid_12:double, c_customer_id:varchar(16), c_last_name:varchar(30), c_first_name:varchar(20)]
                Estimates: {source: CostBasedSourceInfo, rows: 524,499,895 (45.70GB), cpu: 175,362,539,178.07, memory: 570,598,993.11, network: 7,724,777,633.00}
                CPU: 25.39m (31.75%), Scheduled: 4.17h (33.24%), Output: 534,235,692 rows (87.99GB)
                Input avg.: 4,173,716.34 rows, Input std.dev.: 387.30%
                ss_net_paid_12 := CAST(ss_net_paid AS double) (9:27)

uses 25.39m of CPU (edited)

karteekmurthys commented 4 months ago

I ran the benchmark locally for cast from varchar to double. This is comparing with duckdb and measure relative time taken. We are several milliseconds slower when compared to duckdb as well.

cast_varchar_as_double##cast_valid                        246.79ms      4.05
cast_varchar_as_double##cast_valid_nan                     45.05ms     22.20
cast_varchar_as_double##cast_valid_infinity                53.54ms     18.68
cast_varchar_as_double##try_cast_invalid_nan               53.60ms     18.66
cast_varchar_as_double##try_cast_invalid_infini            53.46ms     18.71
cast_varchar_as_double##try_cast_space                     27.01ms     37.02

I further checked why duckdb is fast. The duckdb cast from varchar to floats follows a faster technique based on: https://johnnylee-sde.github.io/Fast-numeric-string-to-int/

I will try to introduce this technique in velox and see if it improves the time. It supposed to bring down the time from O(N) to O(logN).