Native CAST performance slower than Java

aaneja commented 7 months ago

Spawning a new issue for the comment on https://github.com/prestodb/presto/issues/22184#issuecomment-1996569271

Experiment with and without CAST

I was looking at possible causes for the latency difference observed between Native & Java clusters for TPCDS Q23 (SF 10K) and observed the below, w.r.t performance of the CAST operator

On a Native cluster

Measure read speed of the integer column ss_quantity read as-is

presto:tpcds_sf10000_parquet_varchar> explain analyze SELECT ss_quantity FROM store_sales;                                   -> ;
                                                                                                                                                      Query Plan                                                                                   >
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------->
 Fragment 1 [SOURCE]                                                                                                                                                                                                                               >
     CPU: 16.68m, Scheduled: 1.19h, Input: 28,799,864,615 rows (135.44GB); per task: avg.: 1,799,991,538.44 std.dev.: 1,135,109,560.01, Output: 28,799,864,615 rows (105.94GB), 16 tasks                                                           >
     Output layout: [ss_quantity]                                                                                                                                                                                                                  >
     Output partitioning: SINGLE []                                                                                                                                                                                                                >
     Stage Execution Strategy: UNGROUPED_EXECUTION                                                                                                                                                                                                 >
     - TableScan[PlanNodeId 0][TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=tpcds_sf10000_parquet_varchar, tableName=store_sales, analyzePartitionValues=Optional.empty}', layout='Optional[tpcds_sf10000_parquet_v>
             CPU: 16.68m (100.00%), Scheduled: 1.19h (100.00%), Output: 28,799,864,615 rows (105.94GB)                                                                                                                                             >
             Input avg.: 56,249,735.58 rows, Input std.dev.: 661.25%                                                                                                                                                                               >
             LAYOUT: tpcds_sf10000_parquet_varchar.store_sales{}                                                                                                                                                                                   >
             ss_quantity := ss_quantity:int:10:REGULAR (1:41)                                                                                                                                                                                      >
             Input: 28,799,864,615 rows (135.44GB), Filtered: 0.00%                                                                                                                                                                                >
                                                                                                                                                                                                                                                   >
                                                                                                                                                                                                                                                   >
(1 row)

Query 20240314_043226_00018_kvcgs, FINISHED, 17 nodes
Splits: 7,771 total, 7,771 done (100.00%)
[Latency: client-side: 1:42, server-side: 1:42] [28.8B rows, 1.49GB] [283M rows/s, 14.9MB/s]

Compare this against the read speed when we are forced to CAST the column to a decimal

presto:tpcds_sf10000_parquet_varchar> explain analyze SELECT CAST(ss_quantity AS decimal(10,0)) FROM store_sales;
                                                                                                                                                                        Query Plan                                                                 >
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------->
 Fragment 1 [SOURCE]                                                                                                                                                                                                                               >
     CPU: 25.40m, Scheduled: 1.31h, Input: 28,799,864,615 rows (135.44GB); per task: avg.: 1,799,991,538.44 std.dev.: 1,500,572,958.27, Output: 28,799,864,615 rows (208.40GB), 16 tasks                                                           >
     Output layout: [expr]                                                                                                                                                                                                                         >
     Output partitioning: SINGLE []                                                                                                                                                                                                                >
     Stage Execution Strategy: UNGROUPED_EXECUTION                                                                                                                                                                                                 >
     - ScanProject[PlanNodeId 0,1][table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=tpcds_sf10000_parquet_varchar, tableName=store_sales, analyzePartitionValues=Optional.empty}', layout='Optional[tpcds_sf100>
             CPU: 15.87m (62.49%), Scheduled: 16.41m (20.92%), Output: 28,799,864,615 rows (208.40GB)                                                                                                                                              >
             Input avg.: 56,249,735.58 rows, Input std.dev.: 729.65%                                                                                                                                                                               >
             expr := CAST(ss_quantity AS decimal(10,0)) (1:65)                                                                                                                                                                                     >
             LAYOUT: tpcds_sf10000_parquet_varchar.store_sales{}                                                                                                                                                                                   >
             ss_quantity := ss_quantity:int:10:REGULAR (1:64)                                                                                                                                                                                      >
             Input: 28,799,864,615 rows (135.44GB), Filtered: 0.00%                                                                                                                                                                                >
                                                                                                                                                                                                                                                   >
                                                                                                                                                                                                                                                   >
(1 row)

Query 20240314_043432_00019_kvcgs, FINISHED, 17 nodes
Splits: 7,770 total, 7,770 done (100.00%)
[Latency: client-side: 3:24, server-side: 3:23] [28.8B rows, 1.51GB] [142M rows/s, 7.61MB/s]

Latency increased 2x
Read speed decreased 2x

On a Java cluster

Measure read speed of the integer column ss_quantity read as-is

presto> use hive.tpcds_sf10000_parquet_varchar;
USE
presto:tpcds_sf10000_parquet_varchar> explain analyze SELECT ss_quantity FROM store_sales;
                                                                                                                                                      Query Plan                                                                                   >
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------->
 Fragment 1 [SOURCE]                                                                                                                                                                                                                               >
     CPU: 32.10m, Scheduled: 3.06h, Input: 28,799,864,615 rows (29.70GB); per task: avg.: 1,799,991,538.44 std.dev.: 158,762,556.69, Output: 28,799,864,615 rows (134.11GB), 16 tasks                                                              >
     Output layout: [ss_quantity]                                                                                                                                                                                                                  >
     Output partitioning: SINGLE []                                                                                                                                                                                                                >
     Stage Execution Strategy: UNGROUPED_EXECUTION                                                                                                                                                                                                 >
     - TableScan[PlanNodeId 0][TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=tpcds_sf10000_parquet_varchar, tableName=store_sales, analyzePartitionValues=Optional.empty}', layout='Optional[tpcds_sf10000_parquet_v>
             CPU: 32.09m (100.00%), Scheduled: 4.51h (100.00%), Output: 28,799,864,615 rows (134.11GB)                                                                                                                                             >
             Input avg.: 3,714,194.56 rows, Input std.dev.: 19.14%                                                                                                                                                                                 >
             LAYOUT: tpcds_sf10000_parquet_varchar.store_sales{}                                                                                                                                                                                   >
             ss_quantity := ss_quantity:int:10:REGULAR (1:41)                                                                                                                                                                                      >
             Input: 28,799,864,615 rows (29.69GB), Filtered: 0.00%                                                                                                                                                                                 >
                                                                                                                                                                                                                                                   >
                                                                                                                                                                                                                                                   >
(1 row)

Query 20240314_045541_00011_epfyg, FINISHED, 17 nodes
Splits: 7,771 total, 7,771 done (100.00%)
[Latency: client-side: 3:41, server-side: 3:41] [28.8B rows, 29.7GB] [131M rows/s, 138MB/s]

Compare this against the read speed when we are forced to CAST the column to a decimal

presto:tpcds_sf10000_parquet_varchar> explain analyze SELECT CAST(ss_quantity AS decimal(10,0)) FROM store_sales;
                                                                                                                                                                        Query Plan                                                                 >
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------->
 Fragment 1 [SOURCE]                                                                                                                                                                                                                               >
     CPU: 33.14m, Scheduled: 1.91h, Input: 28,799,864,615 rows (29.71GB); per task: avg.: 1,799,991,538.44 std.dev.: 92,245,103.31, Output: 28,799,864,615 rows (241.40GB), 16 tasks                                                               >
     Output layout: [expr]                                                                                                                                                                                                                         >
     Output partitioning: SINGLE []                                                                                                                                                                                                                >
     Stage Execution Strategy: UNGROUPED_EXECUTION                                                                                                                                                                                                 >
     - ScanProject[PlanNodeId 0,1][table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=tpcds_sf10000_parquet_varchar, tableName=store_sales, analyzePartitionValues=Optional.empty}', layout='Optional[tpcds_sf100>
             CPU: 33.11m (100.00%), Scheduled: 3.13h (100.00%), Output: 28,799,864,615 rows (241.41GB)                                                                                                                                             >
             Input avg.: 3,714,673.63 rows, Input std.dev.: 18.99%                                                                                                                                                                                 >
             expr := CAST(ss_quantity AS decimal(10,0)) (1:65)                                                                                                                                                                                     >
             LAYOUT: tpcds_sf10000_parquet_varchar.store_sales{}                                                                                                                                                                                   >
             ss_quantity := ss_quantity:int:10:REGULAR (1:64)                                                                                                                                                                                      >
             Input: 28,799,864,615 rows (29.69GB), Filtered: 0.00%                                                                                                                                                                                 >
                                                                                                                                                                                                                                                   >
                                                                                                                                                                                                                                                   >
(1 row)

Query 20240314_045950_00012_epfyg, FINISHED, 17 nodes
Splits: 7,770 total, 7,770 done (100.00%)
[Latency: client-side: 3:29, server-side: 3:28] [28.8B rows, 29.7GB] [138M rows/s, 146MB/s]

No impact in latency or read speed.
For the query with CAST - the latency is similar to what we observe in Native cluster

Possible cause(s)

CAST is slower on Native clusters
The Native parquet reader (as compared against the Java reader) can read parquet into blocks with fewer overheads so is much faster for a straight read. However, since we are forced to CAST each value in the second query, we lose any benefits we gain from the reader's better performance

aaneja commented 7 months ago

Full performance test - https://gist.github.com/aaneja/dc70f655695933b6ff11978120bbebab

karteekmurthys commented 7 months ago

I tried the scan with/without CAST on Native cluster locally. The difference I noticed was that the PartitionedOutput operator output size is almost 2 times in case of CAST vs without CAST. In general this is the most expensive Operation coz of the writes over the network in both cases, it is just the volume of data is more in case of CAST to decimal. From what it looks like the TableScan +FilterProject (which has the CAST in it) takes about the same time compared to no-cast query.

Native No Cast.	Native With CAST

Possible reason for 2x size in output data: ss_quantity is an INTEGER type (4 byte size) in Velox and the decimal variants are short decimals (8 bytes) and long decimals (16 bytes). When ss_quantity is cast to decimal(10,0) the output size doubles since it is cast to short decimal.

aditi-pandit commented 7 months ago

@karteekmurthys : Lets do a microbenchmark comparing CAST(INT -> DECIMAL) performance in Velox to DuckDB and Presto Java to validate the issue and analyze further.

karteekmurthys commented 7 months ago

Here is a benchmark run compared to duckdb.

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================

cast##cast_decimal_as_bigint                              969.12us     1.03K
cast##cast_int_as_short_decimal                           911.29us     1.10K
cast##cast_int_as_long_decimal                            925.29us     1.08K
cast##cast_bigint_as_short_decimal                        906.91us     1.10K
cast##cast_bigint_as_long_decimal                         922.87us     1.08K

Update: The positive relative time implies Velox is slower compared to DuckDb: Here is how the relative time computed:

 trialResults[actualTrials] = std::make_pair(
          max(0.0, double(nsecs.count()) / timeIterData.niter - globalBaseline),
          std::move(timeIterData.userCounters));

karteekmurthys commented 6 months ago

I ran a presto_server with CLion profiler and found that majority of time is indeed spent in PartitionedOutput read/write. The CRC computation is one of the major factors for the slow down in PartitionedOutput operator.

I tried by disabling the crc compute on PrestoSerializedPage and the total time dropped from 25min to 20.40 min. But for the case without CAST in the query, the CRC check seemed have made no impact, the total time is still ~16 min.

Prestissimo With Cast no CRC check

``` presto:tpcds_sf10000_parquet_varchar> explain analyze SELECT CAST(ss_quantity AS decimal(10,0)) FROM store_sales; Query Plan > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> Fragment 1 [SOURCE] > CPU: 20.40m, Scheduled: 48.42m, Input: 28,799,864,615 rows (135.44GB); per task: avg.: 28,799,864,615.00 std.dev.: 0.00, Output: 28,799,864,615 rows (208.40GB), 1 tasks > Output layout: [expr] > Output partitioning: SINGLE [] > Stage Execution Strategy: UNGROUPED_EXECUTION > - ScanProject[PlanNodeId 0,1][table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=tpcds_sf10000_parquet_varchar, tableName=store_sales, analyzePartitionValues=Optional.empty}> CPU: 7.32m (35.90%), Scheduled: 13.66m (28.21%), Output: 28,799,864,615 rows (208.40GB) > Input avg.: 1,799,991,538.44 rows, Input std.dev.: 387.30% > expr := CAST(ss_quantity AS decimal(10,0)) (1:65) > LAYOUT: tpcds_sf10000_parquet_varchar.store_sales{} > ss_quantity := ss_quantity:int:10:REGULAR (1:64) > Input: 28,799,864,615 rows (135.44GB), Filtered: 0.00% > > > (1 row) ```

Prestissimo without CAST and no CRC check

``` presto:tpcds_sf10000_parquet_varchar> explain analyze SELECT ss_quantity FROM store_sales; Query Plan > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> Fragment 1 [SOURCE] > CPU: 16.88m, Scheduled: 42.67m, Input: 28,799,864,615 rows (135.44GB); per task: avg.: 28,799,864,615.00 std.dev.: 0.00, Output: 28,799,864,615 rows (105.94GB), 1 tasks > Output layout: [ss_quantity] > Output partitioning: SINGLE [] > Stage Execution Strategy: UNGROUPED_EXECUTION > - TableScan[PlanNodeId 0][TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=tpcds_sf10000_parquet_varchar, tableName=store_sales, analyzePartitionValues=Optional.empty}', layout='O> CPU: 16.88m (100.00%), Scheduled: 42.67m (100.00%), Output: 28,799,864,615 rows (105.94GB) > Input avg.: 1,799,991,538.44 rows, Input std.dev.: 387.30% > LAYOUT: tpcds_sf10000_parquet_varchar.store_sales{} > ss_quantity := ss_quantity:int:10:REGULAR (1:41) > Input: 28,799,864,615 rows (135.44GB), Filtered: 0.00% > > > (1 row) ```

yingsu00 commented 6 months ago

The issue was true and can be easily observed from the internal perf dashboard. Screenshot 2024-05-15 at 22 35 18

The cost is mainly in FilterProject operator when doing type casts like:

My guess the perf regression lies in two folds:

New memory allocation
Cast Implementation not vectorized.

I can look into how to improve it.

aaneja commented 4 months ago

Observed this with TPCDS Q74 on a run comparing perf between Java and Prestissimo clusters -

Java fragment 13 where we do a CAST(ss_net_paid AS double):

 - Project[PlanNodeId 3858][projectLocality = LOCAL] => [ss_net_paid_12:double, c_customer_id:varchar(16), c_last_name:varchar(30), c_first_name:varchar(20), $hashvalue_928:bigint]
                Estimates: {source: CostBasedSourceInfo, rows: 524,499,895 (50.09GB), cpu: 355,545,124,759.71, memory: 678,625,159.92, network: 12,553,302,851.47}
                CPU: 1.50m (1.73%), Scheduled: 2.49m (1.06%), Output: 534,235,692 rows (30.13GB)
                Input avg.: 4,173,716.34 rows, Input std.dev.: 39.27%
                ss_net_paid_12 := CAST(ss_net_paid AS double) (9:27)
                $hashvalue_928 := combine_hash(combine_hash(combine_hash(BIGINT'0', COALESCE($operator$hash_code(c_customer_id), BIGINT'0')), COALESCE($operator$hash_code(c_first_name), BIGINT'0')), COALESCE($operator$hash_code(c_last_name), BIGINT'0')) (14:30)

uses 1.50m CPU

vs same for Native -

- Project[PlanNodeId 3894][projectLocality = LOCAL] => [ss_net_paid_12:double, c_customer_id:varchar(16), c_last_name:varchar(30), c_first_name:varchar(20)]
                Estimates: {source: CostBasedSourceInfo, rows: 524,499,895 (45.70GB), cpu: 175,362,539,178.07, memory: 570,598,993.11, network: 7,724,777,633.00}
                CPU: 25.39m (31.75%), Scheduled: 4.17h (33.24%), Output: 534,235,692 rows (87.99GB)
                Input avg.: 4,173,716.34 rows, Input std.dev.: 387.30%
                ss_net_paid_12 := CAST(ss_net_paid AS double) (9:27)

uses 25.39m of CPU (edited)

karteekmurthys commented 4 months ago

I ran the benchmark locally for cast from varchar to double. This is comparing with duckdb and measure relative time taken. We are several milliseconds slower when compared to duckdb as well.

cast_varchar_as_double##cast_valid                        246.79ms      4.05
cast_varchar_as_double##cast_valid_nan                     45.05ms     22.20
cast_varchar_as_double##cast_valid_infinity                53.54ms     18.68
cast_varchar_as_double##try_cast_invalid_nan               53.60ms     18.66
cast_varchar_as_double##try_cast_invalid_infini            53.46ms     18.71
cast_varchar_as_double##try_cast_space                     27.01ms     37.02

I further checked why duckdb is fast. The duckdb cast from varchar to floats follows a faster technique based on: https://johnnylee-sde.github.io/Fast-numeric-string-to-int/

I will try to introduce this technique in velox and see if it improves the time. It supposed to bring down the time from O(N) to O(logN).

prestodb / presto