Open aaneja opened 7 months ago
Full performance test - https://gist.github.com/aaneja/dc70f655695933b6ff11978120bbebab
I tried the scan with/without CAST on Native cluster locally. The difference I noticed was that the PartitionedOutput operator output size is almost 2 times in case of CAST vs without CAST. In general this is the most expensive Operation coz of the writes over the network in both cases, it is just the volume of data is more in case of CAST to decimal. From what it looks like the TableScan +FilterProject (which has the CAST in it)
takes about the same time compared to no-cast query.
Native No Cast. | Native With CAST |
---|---|
Possible reason for 2x size in output data:
ss_quantity
is an INTEGER
type (4 byte size) in Velox and the decimal variants are short decimals (8 bytes) and long decimals (16 bytes). When ss_quantity is cast to decimal(10,0) the output size doubles since it is cast to short decimal.
@karteekmurthys : Lets do a microbenchmark comparing CAST(INT -> DECIMAL) performance in Velox to DuckDB and Presto Java to validate the issue and analyze further.
Here is a benchmark run compared to duckdb.
============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp relative time/iter iters/s
============================================================================
cast##cast_decimal_as_bigint 969.12us 1.03K
cast##cast_int_as_short_decimal 911.29us 1.10K
cast##cast_int_as_long_decimal 925.29us 1.08K
cast##cast_bigint_as_short_decimal 906.91us 1.10K
cast##cast_bigint_as_long_decimal 922.87us 1.08K
Update: The positive relative time implies Velox is slower compared to DuckDb: Here is how the relative time computed:
trialResults[actualTrials] = std::make_pair(
max(0.0, double(nsecs.count()) / timeIterData.niter - globalBaseline),
std::move(timeIterData.userCounters));
I ran a presto_server with CLion profiler and found that majority of time is indeed spent in PartitionedOutput read/write. The CRC computation is one of the major factors for the slow down in PartitionedOutput operator.
I tried by disabling the crc compute on PrestoSerializedPage and the total time dropped from 25min to 20.40 min. But for the case without CAST in the query, the CRC check seemed have made no impact, the total time is still ~16 min.
The issue was true and can be easily observed from the internal perf dashboard.
The cost is mainly in FilterProject operator when doing type casts like:
My guess the perf regression lies in two folds:
I can look into how to improve it.
Observed this with TPCDS Q74 on a run comparing perf between Java and Prestissimo clusters -
Java fragment 13 where we do a CAST(ss_net_paid AS double)
:
- Project[PlanNodeId 3858][projectLocality = LOCAL] => [ss_net_paid_12:double, c_customer_id:varchar(16), c_last_name:varchar(30), c_first_name:varchar(20), $hashvalue_928:bigint]
Estimates: {source: CostBasedSourceInfo, rows: 524,499,895 (50.09GB), cpu: 355,545,124,759.71, memory: 678,625,159.92, network: 12,553,302,851.47}
CPU: 1.50m (1.73%), Scheduled: 2.49m (1.06%), Output: 534,235,692 rows (30.13GB)
Input avg.: 4,173,716.34 rows, Input std.dev.: 39.27%
ss_net_paid_12 := CAST(ss_net_paid AS double) (9:27)
$hashvalue_928 := combine_hash(combine_hash(combine_hash(BIGINT'0', COALESCE($operator$hash_code(c_customer_id), BIGINT'0')), COALESCE($operator$hash_code(c_first_name), BIGINT'0')), COALESCE($operator$hash_code(c_last_name), BIGINT'0')) (14:30)
uses 1.50m CPU
vs same for Native -
- Project[PlanNodeId 3894][projectLocality = LOCAL] => [ss_net_paid_12:double, c_customer_id:varchar(16), c_last_name:varchar(30), c_first_name:varchar(20)]
Estimates: {source: CostBasedSourceInfo, rows: 524,499,895 (45.70GB), cpu: 175,362,539,178.07, memory: 570,598,993.11, network: 7,724,777,633.00}
CPU: 25.39m (31.75%), Scheduled: 4.17h (33.24%), Output: 534,235,692 rows (87.99GB)
Input avg.: 4,173,716.34 rows, Input std.dev.: 387.30%
ss_net_paid_12 := CAST(ss_net_paid AS double) (9:27)
uses 25.39m of CPU (edited)
I ran the benchmark locally for cast from varchar to double. This is comparing with duckdb and measure relative time taken. We are several milliseconds slower when compared to duckdb as well.
cast_varchar_as_double##cast_valid 246.79ms 4.05
cast_varchar_as_double##cast_valid_nan 45.05ms 22.20
cast_varchar_as_double##cast_valid_infinity 53.54ms 18.68
cast_varchar_as_double##try_cast_invalid_nan 53.60ms 18.66
cast_varchar_as_double##try_cast_invalid_infini 53.46ms 18.71
cast_varchar_as_double##try_cast_space 27.01ms 37.02
I further checked why duckdb is fast. The duckdb cast from varchar to floats follows a faster technique based on: https://johnnylee-sde.github.io/Fast-numeric-string-to-int/
I will try to introduce this technique in velox and see if it improves the time. It supposed to bring down the time from O(N) to O(logN).
Spawning a new issue for the comment on https://github.com/prestodb/presto/issues/22184#issuecomment-1996569271
Experiment with and without CAST
I was looking at possible causes for the latency difference observed between Native & Java clusters for TPCDS Q23 (SF 10K) and observed the below, w.r.t performance of the CAST operator
On a Native cluster
Measure read speed of the
integer
columnss_quantity
read as-isCompare this against the read speed when we are forced to CAST the column to a decimal
On a Java cluster
Measure read speed of the
integer
columnss_quantity
read as-isCompare this against the read speed when we are forced to CAST the column to a decimal
Possible cause(s)