rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.42k stars 902 forks source link

[FEA] Improve occupancy during hash table build #15502

Open tgujar opened 7 months ago

tgujar commented 7 months ago

Is your feature request related to a problem? Please describe. cuco insert kernel has poor occupancy due to high register usage during hash table build operation executed by cuDF. If I disable some of the code paths for complex types(commenting out dict, string, list, struct, decimal) in https://github.com/rapidsai/cudf/blob/434df44d9fe1c94e8047bcc37266ae663eae8a8d/cpp/include/cudf/utilities/type_dispatcher.hpp#L456 the type dispatcher, then the register usage per thread drops from 75 -> 46 and leads to a significant occupancy bump. It seems that the insert kernel has to pay the cost of high register usage even for simpler types since the compiler has to account for all code paths.

I did some experiments by disabling different subsets of types, list has types I disable -> register count for insert kernel

Here is the speedup I see on mixed semi join kernel by improving occupancy for int32 keys obtained by disabling complex types image

Describe the solution you'd like Improve occupancy by disabling codepaths for complex types.

Describe alternatives you've considered

  1. Add more template params to the hasher/comparator which allow us to separate codepaths for complex types and simpler types, or
  2. Add JIT compilation to only consider the types necessary for hasher/comparator for a row

Additional context Add any other context, code examples, or references to existing implementations about the feature request here.

tgujar commented 7 months ago

Since option 1 doesnt incur the cost of JIT compilation maybe this is the better approach in terms of performance. My current plan is to arrange the types in increasing order of register usage and split the large switch in type_dispatcher into reasonable chunks. We activate a chunk using branching at CPU, and dispatching the appropriate compile time cond based on type. What do you think about this? This does tie the implementation of the comparator/hasher for a type to the granularity and split of the chunks, but maybe this is okay? Essentially, the compile time conditional would be like has_nested_column but more granular. we can abstract out the CPU side branching for creating the compile time cond into a function which can be used in places where we need to construct a device row hasher/comparator.

if constexpr(cond) {
 switch(...) {
  ...
 }
}
. // repeat for chunks 
.
.
davidwendt commented 7 months ago

This sounds like an interesting approach. But the mixed-semi-join must still work for all types. So I'm still not sure how this helps. The type_dispatcher is used universally across libcudf and I would be reluctant to modify it like this for general usage. I would instead recommend building a new type_dispatcher_chunked that could be used to vet out this idea.

tgujar commented 7 months ago

Although I tested this out only for mixed_semi_join this should be applicable to all hash joins which use device_row_hasher and device_row_comparator. This allows us to compile different versions of the type_dispatcher so that we have lower register usage for rows only using simpler types. We can then dispatch the appropriate version based on CPU side branch instead.

sleeepyjack commented 7 months ago

But the mixed-semi-join must still work for all types.

Right. I guess the idea is that we internally (runtime) dispatch the comparator/hasher type based on the type requirements and then pass the one with the least amount of overhead to the kernel. This is a common pattern I'd say, where each runtime branch leads to a separately compiled kernel. If we can afford the compilation time overhead in cudf, then this is the right way to go. The downside is that if we want this optimization to happen, we have to explicitly type out the if constexpr (has_type_feature) { // dispatch kernel1 } else if (has other_type_feature) { // dispatch kernel2 } else ... logic whenever we use the comparator/hasher. It should still work "for all types", but offers the opportunity to launch a faster kernel if the actual combination of input types allows for it.

vyasr commented 6 months ago

Could this be done using a custom IdTypeMap to the type dispatcher that dispatches unsupported types to null? Perhaps we could define a helper factory to produce such a mapping easily?

davidwendt commented 6 months ago

I was wondering the same as Vyas: Something similar to the dispatch_void_if_nested map from here: https://github.com/rapidsai/cudf/blob/888e9d5c38cb27402313681744b87462846bc405/cpp/include/cudf/table/experimental/row_operators.cuh#L77

vyasr commented 6 months ago

Yes exactly I think we can do something like that except consign even more of the types in the first branch of the conditional_t to void.

tgujar commented 6 months ago

Ah okay, this would also achieve what we need except not have the complexity in the type_dispatcher switch. Makes sense!

tgujar commented 5 months ago

Adding results for reference. Benchmarks from cudf, all join types, speedups from disabling complex types on A100

# inner_join

## [0] NVIDIA A100-PCIE-40GB

|  Key  |  Nullable  |  left_size  |  right_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|--------------|---------|----------|
|  I32  |     0      |    1000     |     1000     |  98.086 us |      10.98% |  96.404 us |      17.15% |    -1.682 us |  -1.71% |   PASS   |
|  I32  |     0      |   100000    |     1000     | 227.694 us |       0.91% | 240.492 us |       1.21% |    12.798 us |   5.62% |   FAIL   |
|  I32  |     0      |  10000000   |     1000     |  77.442 ms |       0.08% |  72.561 ms |       0.27% | -4881.351 us |  -6.30% |   FAIL   |
|  I32  |     0      |   100000    |    100000    | 129.711 us |       2.88% | 122.074 us |       1.65% |    -7.637 us |  -5.89% |   FAIL   |
|  I32  |     0      |  10000000   |    100000    |   2.959 ms |       0.12% |   2.678 ms |       0.19% |  -281.654 us |  -9.52% |   FAIL   |
|  I32  |     0      |  10000000   |   10000000   |   6.161 ms |       0.10% |   4.792 ms |       0.09% | -1368.399 us | -22.21% |   FAIL   |
|  I32  |     1      |    1000     |     1000     | 104.352 us |       4.46% | 102.475 us |       4.44% |    -1.877 us |  -1.80% |   PASS   |
|  I32  |     1      |   100000    |     1000     | 141.904 us |       3.33% | 134.245 us |       3.39% |    -7.659 us |  -5.40% |   FAIL   |
|  I32  |     1      |  10000000   |     1000     |   9.487 ms |       0.07% |   7.559 ms |       0.09% | -1927.830 us | -20.32% |   FAIL   |
|  I32  |     1      |   100000    |    100000    | 127.620 us |       3.44% | 124.913 us |       3.30% |    -2.706 us |  -2.12% |   PASS   |
|  I32  |     1      |  10000000   |    100000    |   1.102 ms |       0.31% | 845.535 us |       0.57% |  -256.198 us | -23.25% |   FAIL   |
|  I32  |     1      |  10000000   |   10000000   |   3.242 ms |       0.15% |   2.233 ms |       0.22% | -1009.196 us | -31.13% |   FAIL   |
|  I64  |     0      |    1000     |     1000     | 101.068 us |       3.71% |  84.700 us |       3.94% |   -16.368 us | -16.19% |   FAIL   |
|  I64  |     0      |   100000    |     1000     | 241.602 us |       1.88% | 250.911 us |       1.34% |     9.309 us |   3.85% |   FAIL   |
|  I64  |     0      |  10000000   |     1000     |  77.688 ms |       0.09% |  73.055 ms |       0.16% | -4633.707 us |  -5.96% |   FAIL   |
|  I64  |     0      |   100000    |    100000    | 133.119 us |       3.02% | 123.304 us |       1.78% |    -9.815 us |  -7.37% |   FAIL   |
|  I64  |     0      |  10000000   |    100000    |   3.045 ms |       0.15% |   2.771 ms |       0.17% |  -274.473 us |  -9.01% |   FAIL   |
|  I64  |     0      |  10000000   |   10000000   |   6.259 ms |       0.07% |   4.880 ms |       0.09% | -1379.167 us | -22.03% |   FAIL   |
|  I64  |     1      |    1000     |     1000     | 108.876 us |       5.20% | 104.452 us |       4.35% |    -4.423 us |  -4.06% |   PASS   |
|  I64  |     1      |   100000    |     1000     | 145.414 us |       2.87% | 135.280 us |       3.01% |   -10.134 us |  -6.97% |   FAIL   |
|  I64  |     1      |  10000000   |     1000     |   9.611 ms |       0.06% |   7.674 ms |       0.09% | -1936.830 us | -20.15% |   FAIL   |
|  I64  |     1      |   100000    |    100000    | 129.239 us |       2.98% | 125.751 us |       3.41% |    -3.487 us |  -2.70% |   PASS   |
|  I64  |     1      |  10000000   |    100000    |   1.134 ms |       0.42% | 868.301 us |       0.60% |  -266.107 us | -23.46% |   FAIL   |
|  I64  |     1      |  10000000   |   10000000   |   3.312 ms |       0.17% |   2.301 ms |       0.27% | -1011.432 us | -30.54% |   FAIL   |

# left_join

## [0] NVIDIA A100-PCIE-40GB

|  Key  |  Nullable  |  left_size  |  right_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|--------------|---------|----------|
|  I32  |     0      |    1000     |     1000     |  92.324 us |       3.73% |  83.777 us |       3.14% |    -8.547 us |  -9.26% |   FAIL   |
|  I32  |     0      |   100000    |     1000     | 228.989 us |       1.26% | 240.727 us |       0.83% |    11.737 us |   5.13% |   FAIL   |
|  I32  |     0      |  10000000   |     1000     |  77.638 ms |       0.10% |  72.904 ms |       0.18% | -4733.820 us |  -6.10% |   FAIL   |
|  I32  |     0      |   100000    |    100000    | 131.165 us |       3.04% | 122.578 us |       1.66% |    -8.586 us |  -6.55% |   FAIL   |
|  I32  |     0      |  10000000   |    100000    |   2.963 ms |       0.17% |   2.676 ms |       0.20% |  -286.724 us |  -9.68% |   FAIL   |
|  I32  |     0      |  10000000   |   10000000   |   6.331 ms |       0.06% |   4.906 ms |       0.13% | -1424.338 us | -22.50% |   FAIL   |
|  I32  |     1      |    1000     |     1000     | 102.337 us |       4.29% | 103.055 us |       4.45% |     0.718 us |   0.70% |   PASS   |
|  I32  |     1      |   100000    |     1000     | 144.638 us |       2.41% | 134.510 us |       3.18% |   -10.127 us |  -7.00% |   FAIL   |
|  I32  |     1      |  10000000   |     1000     |   9.513 ms |       0.07% |   7.590 ms |       0.11% | -1922.358 us | -20.21% |   FAIL   |
|  I32  |     1      |   100000    |    100000    | 132.344 us |       3.64% | 125.750 us |       3.37% |    -6.594 us |  -4.98% |   FAIL   |
|  I32  |     1      |  10000000   |    100000    |   1.103 ms |       0.42% | 848.855 us |       0.49% |  -254.131 us | -23.04% |   FAIL   |
|  I32  |     1      |  10000000   |   10000000   |   3.422 ms |       0.12% |   2.398 ms |       0.23% | -1023.998 us | -29.92% |   FAIL   |
|  I64  |     0      |    1000     |     1000     |  97.573 us |      12.68% |  89.277 us |      13.56% |    -8.296 us |  -8.50% |   PASS   |
|  I64  |     0      |   100000    |     1000     | 246.080 us |       0.88% | 253.108 us |       1.86% |     7.028 us |   2.86% |   FAIL   |
|  I64  |     0      |  10000000   |     1000     |  77.868 ms |       0.02% |  73.247 ms |       0.18% | -4621.161 us |  -5.93% |   FAIL   |
|  I64  |     0      |   100000    |    100000    | 137.802 us |       1.51% | 123.913 us |       1.69% |   -13.889 us | -10.08% |   FAIL   |
|  I64  |     0      |  10000000   |    100000    |   3.051 ms |       0.14% |   2.772 ms |       0.21% |  -279.087 us |  -9.15% |   FAIL   |
|  I64  |     0      |  10000000   |   10000000   |   6.432 ms |       0.06% |   4.996 ms |       0.09% | -1436.381 us | -22.33% |   FAIL   |
|  I64  |     1      |    1000     |     1000     | 112.130 us |       5.81% | 104.755 us |       4.48% |    -7.375 us |  -6.58% |   FAIL   |
|  I64  |     1      |   100000    |     1000     | 148.167 us |       3.29% | 136.024 us |       3.25% |   -12.144 us |  -8.20% |   FAIL   |
|  I64  |     1      |  10000000   |     1000     |   9.667 ms |       0.08% |   7.678 ms |       0.09% | -1989.216 us | -20.58% |   FAIL   |
|  I64  |     1      |   100000    |    100000    | 137.450 us |       2.99% | 128.075 us |       2.86% |    -9.375 us |  -6.82% |   FAIL   |
|  I64  |     1      |  10000000   |    100000    |   1.139 ms |       0.33% | 871.557 us |       0.48% |  -267.216 us | -23.47% |   FAIL   |
|  I64  |     1      |  10000000   |   10000000   |   3.493 ms |       0.13% |   2.442 ms |       0.17% | -1050.825 us | -30.08% |   FAIL   |

# full_join

## [0] NVIDIA A100-PCIE-40GB

|  Key  |  Nullable  |  left_size  |  right_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|--------------|---------|----------|
|  I32  |     0      |    1000     |     1000     | 166.829 us |       2.76% | 159.519 us |       2.33% |    -7.310 us |  -4.38% |   FAIL   |
|  I32  |     0      |   100000    |     1000     | 309.185 us |       1.64% | 322.705 us |       1.15% |    13.520 us |   4.37% |   FAIL   |
|  I32  |     0      |  10000000   |     1000     |  77.990 ms |       0.02% |  73.489 ms |       0.13% | -4500.572 us |  -5.77% |   FAIL   |
|  I32  |     0      |   100000    |    100000    | 216.151 us |       1.80% | 204.264 us |       1.79% |   -11.887 us |  -5.50% |   FAIL   |
|  I32  |     0      |  10000000   |    100000    |   3.507 ms |       0.20% |   3.213 ms |       0.18% |  -294.624 us |  -8.40% |   FAIL   |
|  I32  |     0      |  10000000   |   10000000   |   6.991 ms |       0.07% |   5.569 ms |       0.09% | -1422.066 us | -20.34% |   FAIL   |
|  I32  |     1      |    1000     |     1000     | 182.399 us |       2.86% | 181.162 us |       3.14% |    -1.237 us |  -0.68% |   PASS   |
|  I32  |     1      |   100000    |     1000     | 223.190 us |       2.49% | 215.676 us |       2.38% |    -7.514 us |  -3.37% |   FAIL   |
|  I32  |     1      |  10000000   |     1000     |   9.865 ms |       0.07% |   7.920 ms |       0.10% | -1944.961 us | -19.72% |   FAIL   |
|  I32  |     1      |   100000    |    100000    | 215.615 us |       2.85% | 205.486 us |       2.15% |   -10.130 us |  -4.70% |   FAIL   |
|  I32  |     1      |  10000000   |    100000    |   1.435 ms |       0.40% |   1.181 ms |       0.46% |  -253.949 us | -17.69% |   FAIL   |
|  I32  |     1      |  10000000   |   10000000   |   3.923 ms |       0.14% |   2.899 ms |       0.19% | -1023.511 us | -26.09% |   FAIL   |
|  I64  |     0      |    1000     |     1000     | 174.889 us |       2.72% | 166.039 us |       2.74% |    -8.849 us |  -5.06% |   FAIL   |
|  I64  |     0      |   100000    |     1000     | 325.375 us |       0.97% | 333.593 us |       1.54% |     8.218 us |   2.53% |   FAIL   |
|  I64  |     0      |  10000000   |     1000     |  78.326 ms |       0.02% |  73.823 ms |       0.20% | -4503.365 us |  -5.75% |   FAIL   |
|  I64  |     0      |   100000    |    100000    | 218.765 us |       1.41% | 204.268 us |       1.79% |   -14.497 us |  -6.63% |   FAIL   |
|  I64  |     0      |  10000000   |    100000    |   3.594 ms |       0.16% |   3.303 ms |       0.18% |  -291.079 us |  -8.10% |   FAIL   |
|  I64  |     0      |  10000000   |   10000000   |   7.096 ms |       0.08% |   5.659 ms |       0.08% | -1437.104 us | -20.25% |   FAIL   |
|  I64  |     1      |    1000     |     1000     | 192.823 us |       4.05% | 185.063 us |       2.98% |    -7.760 us |  -4.02% |   FAIL   |
|  I64  |     1      |   100000    |     1000     | 232.694 us |       2.00% | 216.077 us |       2.11% |   -16.617 us |  -7.14% |   FAIL   |
|  I64  |     1      |  10000000   |     1000     |   9.981 ms |       0.08% |   8.019 ms |       0.10% | -1962.527 us | -19.66% |   FAIL   |
|  I64  |     1      |   100000    |    100000    | 221.943 us |       2.64% | 207.881 us |       2.72% |   -14.062 us |  -6.34% |   FAIL   |
|  I64  |     1      |  10000000   |    100000    |   1.472 ms |       0.32% |   1.205 ms |       0.46% |  -266.914 us | -18.13% |   FAIL   |
|  I64  |     1      |  10000000   |   10000000   |   3.990 ms |       0.13% |   2.948 ms |       0.39% | -1041.570 us | -26.11% |   FAIL   |

# mixed_inner_join

## [0] NVIDIA A100-PCIE-40GB

|  Key  |  Nullable  |  left_size  |  right_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|--------------|---------|----------|
|  I32  |     0      |    1000     |     1000     | 169.332 us |       6.73% | 160.332 us |       2.42% |    -9.000 us |  -5.31% |   FAIL   |
|  I32  |     0      |   100000    |     1000     | 195.335 us |       2.98% | 176.477 us |       1.84% |   -18.858 us |  -9.65% |   FAIL   |
|  I32  |     0      |  10000000   |     1000     |   4.233 ms |       0.50% |   2.507 ms |       0.59% | -1725.877 us | -40.77% |   FAIL   |
|  I32  |     0      |   100000    |    100000    | 222.984 us |       2.38% | 199.399 us |       2.61% |   -23.585 us | -10.58% |   FAIL   |
|  I32  |     0      |  10000000   |    100000    |   5.237 ms |       0.11% |   3.165 ms |       0.16% | -2072.743 us | -39.58% |   FAIL   |
|  I32  |     0      |  10000000   |   10000000   |   9.186 ms |       0.06% |   7.028 ms |       0.08% | -2157.709 us | -23.49% |   FAIL   |
|  I32  |     1      |    1000     |     1000     | 163.226 us |       3.29% | 160.243 us |       2.92% |    -2.982 us |  -1.83% |   PASS   |
|  I32  |     1      |   100000    |     1000     | 190.103 us |       3.62% | 184.166 us |       3.67% |    -5.936 us |  -3.12% |   PASS   |
|  I32  |     1      |  10000000   |     1000     |   2.765 ms |       0.15% |   2.170 ms |       0.20% |  -594.808 us | -21.51% |   FAIL   |
|  I32  |     1      |   100000    |    100000    | 200.366 us |       2.95% | 191.996 us |       2.90% |    -8.370 us |  -4.18% |   FAIL   |
|  I32  |     1      |  10000000   |    100000    |   3.128 ms |       0.18% |   2.421 ms |       0.23% |  -706.507 us | -22.59% |   FAIL   |
|  I32  |     1      |  10000000   |   10000000   |   4.303 ms |       0.15% |   3.344 ms |       0.14% |  -958.981 us | -22.29% |   FAIL   |
|  I64  |     0      |    1000     |     1000     | 176.402 us |       1.70% | 164.130 us |       1.93% |   -12.272 us |  -6.96% |   FAIL   |
|  I64  |     0      |   100000    |     1000     | 215.997 us |       1.67% | 186.102 us |       2.66% |   -29.895 us | -13.84% |   FAIL   |
|  I64  |     0      |  10000000   |     1000     |   4.736 ms |       0.33% |   2.824 ms |       0.37% | -1911.393 us | -40.36% |   FAIL   |
|  I64  |     0      |   100000    |    100000    | 229.272 us |       1.15% | 202.858 us |       1.16% |   -26.414 us | -11.52% |   FAIL   |
|  I64  |     0      |  10000000   |    100000    |   5.513 ms |       0.21% |   3.347 ms |       0.21% | -2165.892 us | -39.29% |   FAIL   |
|  I64  |     0      |  10000000   |   10000000   |   9.374 ms |       0.04% |   7.189 ms |       0.11% | -2185.703 us | -23.32% |   FAIL   |
|  I64  |     1      |    1000     |     1000     | 167.720 us |       4.29% | 169.695 us |       5.16% |     1.975 us |   1.18% |   PASS   |
|  I64  |     1      |   100000    |     1000     | 200.827 us |       2.62% | 185.654 us |       2.14% |   -15.173 us |  -7.56% |   FAIL   |
|  I64  |     1      |  10000000   |     1000     |   2.830 ms |       0.21% |   2.221 ms |       0.18% |  -609.265 us | -21.53% |   FAIL   |
|  I64  |     1      |   100000    |    100000    | 207.317 us |       2.38% | 195.208 us |       2.67% |   -12.109 us |  -5.84% |   FAIL   |
|  I64  |     1      |  10000000   |    100000    |   3.258 ms |       0.15% |   2.517 ms |       0.21% |  -741.170 us | -22.75% |   FAIL   |
|  I64  |     1      |  10000000   |   10000000   |   4.402 ms |       0.12% |   3.405 ms |       0.24% |  -996.264 us | -22.63% |   FAIL   |

# mixed_left_join

## [0] NVIDIA A100-PCIE-40GB

|  Key  |  Nullable  |  left_size  |  right_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|--------------|---------|----------|
|  I32  |     0      |    1000     |     1000     | 162.393 us |       2.19% | 163.145 us |       3.06% |     0.752 us |   0.46% |   PASS   |
|  I32  |     0      |   100000    |     1000     | 198.993 us |       1.80% | 183.776 us |       2.78% |   -15.217 us |  -7.65% |   FAIL   |
|  I32  |     0      |  10000000   |     1000     |   4.390 ms |       0.50% |   2.821 ms |       0.66% | -1568.944 us | -35.74% |   FAIL   |
|  I32  |     0      |   100000    |    100000    | 225.268 us |       1.53% | 208.287 us |       2.38% |   -16.981 us |  -7.54% |   FAIL   |
|  I32  |     0      |  10000000   |    100000    |   5.403 ms |       0.11% |   3.467 ms |       0.16% | -1935.176 us | -35.82% |   FAIL   |
|  I32  |     0      |  10000000   |   10000000   |   9.380 ms |       0.08% |   7.531 ms |       0.08% | -1848.927 us | -19.71% |   FAIL   |
|  I32  |     1      |    1000     |     1000     | 161.592 us |       2.42% | 166.762 us |       3.71% |     5.169 us |   3.20% |   FAIL   |
|  I32  |     1      |   100000    |     1000     | 190.968 us |       2.38% | 189.389 us |       2.76% |    -1.579 us |  -0.83% |   PASS   |
|  I32  |     1      |  10000000   |     1000     |   2.851 ms |       0.17% |   2.459 ms |       0.16% |  -391.901 us | -13.74% |   FAIL   |
|  I32  |     1      |   100000    |    100000    | 202.721 us |       3.10% | 195.817 us |       2.39% |    -6.904 us |  -3.41% |   FAIL   |
|  I32  |     1      |  10000000   |    100000    |   3.210 ms |       0.17% |   2.717 ms |       0.16% |  -492.765 us | -15.35% |   FAIL   |
|  I32  |     1      |  10000000   |   10000000   |   4.385 ms |       0.12% |   3.709 ms |       0.14% |  -675.739 us | -15.41% |   FAIL   |
|  I64  |     0      |    1000     |     1000     | 176.685 us |       1.68% | 175.911 us |       1.87% |    -0.774 us |  -0.44% |   PASS   |
|  I64  |     0      |   100000    |     1000     | 216.935 us |       1.92% | 196.961 us |       2.35% |   -19.975 us |  -9.21% |   FAIL   |
|  I64  |     0      |  10000000   |     1000     |   4.828 ms |       0.39% |   3.257 ms |       0.34% | -1571.047 us | -32.54% |   FAIL   |
|  I64  |     0      |   100000    |    100000    | 229.205 us |       1.58% | 215.238 us |       1.39% |   -13.967 us |  -6.09% |   FAIL   |
|  I64  |     0      |  10000000   |    100000    |   5.622 ms |       0.16% |   3.713 ms |       0.13% | -1909.231 us | -33.96% |   FAIL   |
|  I64  |     0      |  10000000   |   10000000   |   9.529 ms |       0.05% |   7.710 ms |       0.10% | -1818.191 us | -19.08% |   FAIL   |
|  I64  |     1      |    1000     |     1000     | 169.657 us |       3.64% | 169.704 us |       5.40% |     0.047 us |   0.03% |   PASS   |
|  I64  |     1      |   100000    |     1000     | 201.867 us |       2.30% | 189.909 us |       2.16% |   -11.958 us |  -5.92% |   FAIL   |
|  I64  |     1      |  10000000   |     1000     |   2.916 ms |       0.20% |   2.504 ms |       0.16% |  -411.743 us | -14.12% |   FAIL   |
|  I64  |     1      |   100000    |    100000    | 208.764 us |       2.42% | 197.789 us |       2.54% |   -10.975 us |  -5.26% |   FAIL   |
|  I64  |     1      |  10000000   |    100000    |   3.349 ms |       0.17% |   2.824 ms |       0.20% |  -525.004 us | -15.68% |   FAIL   |
|  I64  |     1      |  10000000   |   10000000   |   4.487 ms |       0.13% |   3.781 ms |       0.11% |  -706.651 us | -15.75% |   FAIL   |

# mixed_full_join

## [0] NVIDIA A100-PCIE-40GB

|  Key  |  Nullable  |  left_size  |  right_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|--------------|---------|----------|
|  I32  |     0      |    1000     |     1000     | 245.243 us |       2.57% | 245.005 us |       2.24% |    -0.238 us |  -0.10% |   PASS   |
|  I32  |     0      |   100000    |     1000     | 252.424 us |       1.85% | 236.983 us |       2.68% |   -15.441 us |  -6.12% |   FAIL   |
|  I32  |     0      |  10000000   |     1000     |   4.801 ms |       0.48% |   3.250 ms |       0.61% | -1550.839 us | -32.30% |   FAIL   |
|  I32  |     0      |   100000    |    100000    | 314.877 us |       1.67% | 299.013 us |       2.45% |   -15.865 us |  -5.04% |   FAIL   |
|  I32  |     0      |  10000000   |    100000    |   5.548 ms |       0.12% |   3.609 ms |       0.17% | -1938.081 us | -34.94% |   FAIL   |
|  I32  |     0      |  10000000   |   10000000   |  10.065 ms |       0.06% |   8.218 ms |       0.09% | -1847.514 us | -18.36% |   FAIL   |
|  I32  |     1      |    1000     |     1000     | 245.065 us |       2.96% | 253.720 us |       2.95% |     8.656 us |   3.53% |   FAIL   |
|  I32  |     1      |   100000    |     1000     | 274.089 us |       1.57% | 276.936 us |       1.91% |     2.848 us |   1.04% |   PASS   |
|  I32  |     1      |  10000000   |     1000     |   3.123 ms |       0.19% |   2.735 ms |       0.22% |  -388.611 us | -12.44% |   FAIL   |
|  I32  |     1      |   100000    |    100000    | 288.628 us |       2.01% | 288.971 us |       1.95% |     0.343 us |   0.12% |   PASS   |
|  I32  |     1      |  10000000   |    100000    |   3.466 ms |       0.18% |   2.979 ms |       0.19% |  -486.333 us | -14.03% |   FAIL   |
|  I32  |     1      |  10000000   |   10000000   |   4.893 ms |       0.11% |   4.223 ms |       0.14% |  -669.865 us | -13.69% |   FAIL   |
|  I64  |     0      |    1000     |     1000     | 259.162 us |       2.31% | 258.211 us |       1.57% |    -0.951 us |  -0.37% |   PASS   |
|  I64  |     0      |   100000    |     1000     | 270.075 us |       1.47% | 250.506 us |       2.16% |   -19.569 us |  -7.25% |   FAIL   |
|  I64  |     0      |  10000000   |     1000     |   5.330 ms |       0.36% |   3.696 ms |       0.31% | -1634.293 us | -30.66% |   FAIL   |
|  I64  |     0      |   100000    |    100000    | 318.434 us |       1.82% | 305.237 us |       1.81% |   -13.197 us |  -4.14% |   FAIL   |
|  I64  |     0      |  10000000   |    100000    |   5.764 ms |       0.10% |   3.857 ms |       0.18% | -1906.983 us | -33.08% |   FAIL   |
|  I64  |     0      |  10000000   |   10000000   |  10.215 ms |       0.06% |   8.399 ms |       0.07% | -1815.278 us | -17.77% |   FAIL   |
|  I64  |     1      |    1000     |     1000     | 250.737 us |       2.76% | 252.327 us |       4.18% |     1.590 us |   0.63% |   PASS   |
|  I64  |     1      |   100000    |     1000     | 285.622 us |       2.68% | 276.006 us |       2.14% |    -9.615 us |  -3.37% |   FAIL   |
|  I64  |     1      |  10000000   |     1000     |   3.187 ms |       0.15% |   2.771 ms |       0.17% |  -415.906 us | -13.05% |   FAIL   |
|  I64  |     1      |   100000    |    100000    | 294.446 us |       1.82% | 290.163 us |       1.78% |    -4.283 us |  -1.45% |   PASS   |
|  I64  |     1      |  10000000   |    100000    |   3.604 ms |       0.17% |   3.081 ms |       0.23% |  -522.999 us | -14.51% |   FAIL   |
|  I64  |     1      |  10000000   |   10000000   |   4.995 ms |       0.12% |   4.294 ms |       0.13% |  -701.074 us | -14.04% |   FAIL   |

# mixed_left_semi_join

## [0] NVIDIA A100-PCIE-40GB

|  Key  |  Nullable  |  left_size  |  right_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|--------------|---------|----------|
|  I32  |     0      |    1000     |     1000     | 145.613 us |       2.26% | 144.217 us |       2.23% |    -1.395 us |  -0.96% |   PASS   |
|  I32  |     0      |   100000    |     1000     | 170.829 us |       1.41% | 152.757 us |       1.77% |   -18.072 us | -10.58% |   FAIL   |
|  I32  |     0      |  10000000   |     1000     |   1.895 ms |       0.19% |   1.006 ms |       0.60% |  -889.331 us | -46.93% |   FAIL   |
|  I32  |     0      |   100000    |    100000    | 202.062 us |       1.80% | 173.403 us |       1.76% |   -28.659 us | -14.18% |   FAIL   |
|  I32  |     0      |  10000000   |    100000    |   2.211 ms |       0.22% | 924.633 us |       0.86% | -1286.137 us | -58.18% |   FAIL   |
|  I32  |     0      |  10000000   |   10000000   |   6.987 ms |       0.06% |   4.590 ms |       0.33% | -2396.962 us | -34.31% |   FAIL   |
|  I32  |     1      |    1000     |     1000     | 167.868 us |       2.66% | 162.233 us |       3.45% |    -5.635 us |  -3.36% |   FAIL   |
|  I32  |     1      |   100000    |     1000     | 185.694 us |       2.22% | 185.361 us |       3.01% |    -0.333 us |  -0.18% |   PASS   |
|  I32  |     1      |  10000000   |     1000     |   1.458 ms |       0.27% |   2.204 ms |       0.29% |   746.145 us |  51.17% |   FAIL   |
|  I32  |     1      |   100000    |    100000    | 210.904 us |       2.24% | 204.400 us |       2.79% |    -6.505 us |  -3.08% |   FAIL   |
|  I32  |     1      |  10000000   |    100000    |   1.512 ms |       0.34% |   2.211 ms |       0.26% |   699.026 us |  46.25% |   FAIL   |
|  I32  |     1      |  10000000   |   10000000   |   4.614 ms |       0.11% |   4.338 ms |       0.12% |  -275.850 us |  -5.98% |   FAIL   |
|  I64  |     0      |    1000     |     1000     | 149.793 us |       1.96% | 145.067 us |       1.58% |    -4.727 us |  -3.16% |   FAIL   |
|  I64  |     0      |   100000    |     1000     | 174.027 us |       2.17% | 153.994 us |       1.76% |   -20.033 us | -11.51% |   FAIL   |
|  I64  |     0      |  10000000   |     1000     |   2.043 ms |       0.19% |   1.107 ms |       0.51% |  -936.674 us | -45.84% |   FAIL   |
|  I64  |     0      |   100000    |    100000    | 211.646 us |       1.56% | 175.540 us |       1.63% |   -36.106 us | -17.06% |   FAIL   |
|  I64  |     0      |  10000000   |    100000    |   2.369 ms |       0.17% | 969.291 us |       0.70% | -1399.863 us | -59.09% |   FAIL   |
|  I64  |     0      |  10000000   |   10000000   |   7.239 ms |       0.07% |   4.881 ms |       0.26% | -2357.918 us | -32.57% |   FAIL   |
|  I64  |     1      |    1000     |     1000     | 168.605 us |       2.97% | 161.085 us |       2.74% |    -7.520 us |  -4.46% |   FAIL   |
|  I64  |     1      |   100000    |     1000     | 183.886 us |       2.24% | 185.553 us |       2.77% |     1.667 us |   0.91% |   PASS   |
|  I64  |     1      |  10000000   |     1000     |   1.408 ms |       0.33% |   2.100 ms |       0.23% |   692.571 us |  49.20% |   FAIL   |
|  I64  |     1      |   100000    |    100000    | 207.247 us |       2.44% | 205.700 us |       2.49% |    -1.547 us |  -0.75% |   PASS   |
|  I64  |     1      |  10000000   |    100000    |   1.550 ms |       0.28% |   2.252 ms |       0.25% |   702.023 us |  45.28% |   FAIL   |
|  I64  |     1      |  10000000   |   10000000   |   4.710 ms |       0.09% |   4.423 ms |       0.13% |  -286.960 us |  -6.09% |   FAIL   |

# mixed_left_anti_join

## [0] NVIDIA A100-PCIE-40GB

|  Key  |  Nullable  |  left_size  |  right_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|--------------|---------|----------|
|  I32  |     0      |    1000     |     1000     | 145.609 us |       2.11% | 144.382 us |       2.63% |    -1.227 us |  -0.84% |   PASS   |
|  I32  |     0      |   100000    |     1000     | 171.825 us |       2.52% | 153.282 us |       2.10% |   -18.543 us | -10.79% |   FAIL   |
|  I32  |     0      |  10000000   |     1000     |   1.904 ms |       0.25% |   1.014 ms |       0.61% |  -889.953 us | -46.74% |   FAIL   |
|  I32  |     0      |   100000    |    100000    | 202.480 us |       1.94% | 173.366 us |       1.72% |   -29.114 us | -14.38% |   FAIL   |
|  I32  |     0      |  10000000   |    100000    |   2.217 ms |       0.19% | 937.196 us |       0.64% | -1279.905 us | -57.73% |   FAIL   |
|  I32  |     0      |  10000000   |   10000000   |   6.993 ms |       0.05% |   4.595 ms |       0.24% | -2398.078 us | -34.29% |   FAIL   |
|  I32  |     1      |    1000     |     1000     | 168.087 us |       2.87% | 162.103 us |       3.43% |    -5.983 us |  -3.56% |   FAIL   |
|  I32  |     1      |   100000    |     1000     | 185.806 us |       2.24% | 185.825 us |       3.09% |     0.019 us |   0.01% |   PASS   |
|  I32  |     1      |  10000000   |     1000     |   1.468 ms |       0.28% |   2.214 ms |       0.29% |   746.122 us |  50.83% |   FAIL   |
|  I32  |     1      |   100000    |    100000    | 210.234 us |       2.17% | 204.434 us |       2.26% |    -5.801 us |  -2.76% |   FAIL   |
|  I32  |     1      |  10000000   |    100000    |   1.520 ms |       0.34% |   2.221 ms |       0.27% |   700.324 us |  46.07% |   FAIL   |
|  I32  |     1      |  10000000   |   10000000   |   4.623 ms |       0.13% |   4.346 ms |       0.11% |  -276.375 us |  -5.98% |   FAIL   |
|  I64  |     0      |    1000     |     1000     | 150.301 us |       2.38% | 145.754 us |       2.26% |    -4.548 us |  -3.03% |   FAIL   |
|  I64  |     0      |   100000    |     1000     | 175.850 us |       2.65% | 155.522 us |       2.32% |   -20.328 us | -11.56% |   FAIL   |
|  I64  |     0      |  10000000   |     1000     |   2.053 ms |       0.22% |   1.116 ms |       0.60% |  -937.011 us | -45.65% |   FAIL   |
|  I64  |     0      |   100000    |    100000    | 211.996 us |       1.54% | 176.533 us |       1.88% |   -35.463 us | -16.73% |   FAIL   |
|  I64  |     0      |  10000000   |    100000    |   2.376 ms |       0.12% | 979.465 us |       0.90% | -1396.656 us | -58.78% |   FAIL   |
|  I64  |     0      |  10000000   |   10000000   |   7.247 ms |       0.05% |   4.889 ms |       0.32% | -2357.155 us | -32.53% |   FAIL   |
|  I64  |     1      |    1000     |     1000     | 165.998 us |       2.76% | 161.697 us |       2.75% |    -4.301 us |  -2.59% |   PASS   |
|  I64  |     1      |   100000    |     1000     | 185.875 us |       2.65% | 186.063 us |       2.70% |     0.188 us |   0.10% |   PASS   |
|  I64  |     1      |  10000000   |     1000     |   1.416 ms |       0.29% |   2.110 ms |       0.22% |   693.833 us |  48.98% |   FAIL   |
|  I64  |     1      |   100000    |    100000    | 210.868 us |       2.16% | 205.904 us |       2.40% |    -4.964 us |  -2.35% |   FAIL   |
|  I64  |     1      |  10000000   |    100000    |   1.557 ms |       0.30% |   2.262 ms |       0.26% |   704.716 us |  45.27% |   FAIL   |
|  I64  |     1      |  10000000   |   10000000   |   4.718 ms |       0.09% |   4.432 ms |       0.14% |  -286.488 us |  -6.07% |   FAIL   |

# distinct_inner_join

## [0] NVIDIA A100-PCIE-40GB

|  Key  |  Nullable  |  left_size  |  right_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|---------------|---------|----------|
|  I32  |     0      |    1000     |     1000     |  65.532 us |       3.09% |  64.982 us |       2.64% |     -0.550 us |  -0.84% |   PASS   |
|  I32  |     0      |   100000    |     1000     | 177.077 us |       2.45% | 173.223 us |       1.58% |     -3.854 us |  -2.18% |   FAIL   |
|  I32  |     0      |  10000000   |     1000     | 127.847 ms |       0.23% | 112.891 ms |       0.29% | -14955.437 us | -11.70% |   FAIL   |
|  I32  |     0      |   100000    |    100000    |  74.174 us |       2.04% |  73.944 us |       2.18% |     -0.230 us |  -0.31% |   PASS   |
|  I32  |     0      |  10000000   |    100000    |   2.902 ms |       0.15% |   2.555 ms |       0.18% |   -346.279 us | -11.93% |   FAIL   |
|  I32  |     0      |  10000000   |   10000000   |   3.399 ms |       0.11% |   3.175 ms |       0.11% |   -223.497 us |  -6.58% |   FAIL   |
|  I32  |     1      |    1000     |     1000     |  79.814 us |       5.41% |  76.913 us |       7.52% |     -2.901 us |  -3.63% |   PASS   |
|  I32  |     1      |   100000    |     1000     |  92.840 us |       3.88% |  92.128 us |       3.51% |     -0.712 us |  -0.77% |   PASS   |
|  I32  |     1      |  10000000   |     1000     |   8.466 ms |       0.12% |   7.823 ms |       0.20% |   -643.320 us |  -7.60% |   FAIL   |
|  I32  |     1      |   100000    |    100000    |  83.879 us |       3.92% |  83.950 us |       3.18% |      0.071 us |   0.08% |   PASS   |
|  I32  |     1      |  10000000   |    100000    | 720.880 us |       0.46% | 644.439 us |       0.53% |    -76.441 us | -10.60% |   FAIL   |
|  I32  |     1      |  10000000   |   10000000   |   1.251 ms |       0.33% |   1.123 ms |       0.35% |   -128.557 us | -10.27% |   FAIL   |
|  I64  |     0      |    1000     |     1000     |  59.217 us |       6.45% |  56.825 us |       3.13% |     -2.392 us |  -4.04% |   FAIL   |
|  I64  |     0      |   100000    |     1000     | 160.741 us |       2.43% | 156.386 us |       1.53% |     -4.354 us |  -2.71% |   FAIL   |
|  I64  |     0      |  10000000   |     1000     | 121.953 ms |       0.16% | 108.089 ms |       0.31% | -13864.132 us | -11.37% |   FAIL   |
|  I64  |     0      |   100000    |    100000    |  74.808 us |       3.29% |  75.774 us |       2.21% |      0.966 us |   1.29% |   PASS   |
|  I64  |     0      |  10000000   |    100000    |   2.905 ms |       0.17% |   2.559 ms |       0.21% |   -346.695 us | -11.93% |   FAIL   |
|  I64  |     0      |  10000000   |   10000000   |   3.441 ms |       0.11% |   3.222 ms |       0.13% |   -218.840 us |  -6.36% |   FAIL   |
|  I64  |     1      |    1000     |     1000     |  78.261 us |       5.32% |  80.001 us |       5.87% |      1.740 us |   2.22% |   PASS   |
|  I64  |     1      |   100000    |     1000     |  96.872 us |       4.55% |  92.556 us |       2.97% |     -4.315 us |  -4.45% |   FAIL   |
|  I64  |     1      |  10000000   |     1000     |   8.377 ms |       0.16% |   7.730 ms |       0.24% |   -646.769 us |  -7.72% |   FAIL   |
|  I64  |     1      |   100000    |    100000    |  84.971 us |       5.01% |  84.142 us |       3.24% |     -0.829 us |  -0.98% |   PASS   |
|  I64  |     1      |  10000000   |    100000    | 735.653 us |       0.49% | 655.612 us |       0.48% |    -80.041 us | -10.88% |   FAIL   |
|  I64  |     1      |  10000000   |   10000000   |   1.275 ms |       0.30% |   1.150 ms |       0.33% |   -124.760 us |  -9.79% |   FAIL   |

# distinct_left_join

## [0] NVIDIA A100-PCIE-40GB

|  Key  |  Nullable  |  left_size  |  right_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|---------------|---------|----------|
|  I32  |     0      |    1000     |     1000     |  45.936 us |       2.89% |  45.162 us |       3.16% |     -0.774 us |  -1.69% |   PASS   |
|  I32  |     0      |   100000    |     1000     | 158.619 us |       1.63% | 157.917 us |       1.45% |     -0.702 us |  -0.44% |   PASS   |
|  I32  |     0      |  10000000   |     1000     | 128.105 ms |       0.12% | 112.870 ms |       0.19% | -15234.979 us | -11.89% |   FAIL   |
|  I32  |     0      |   100000    |    100000    |  54.520 us |       2.29% |  52.926 us |       2.26% |     -1.594 us |  -2.92% |   FAIL   |
|  I32  |     0      |  10000000   |    100000    |   2.879 ms |       0.16% |   2.536 ms |       0.15% |   -343.242 us | -11.92% |   FAIL   |
|  I32  |     0      |  10000000   |   10000000   |   3.103 ms |       0.10% |   2.999 ms |       0.10% |   -104.180 us |  -3.36% |   FAIL   |
|  I32  |     1      |    1000     |     1000     |  62.031 us |       6.20% |  60.302 us |       6.83% |     -1.730 us |  -2.79% |   PASS   |
|  I32  |     1      |   100000    |     1000     |  77.090 us |       3.43% |  75.199 us |       4.36% |     -1.892 us |  -2.45% |   PASS   |
|  I32  |     1      |  10000000   |     1000     |   8.448 ms |       0.11% |   7.804 ms |       0.22% |   -644.545 us |  -7.63% |   FAIL   |
|  I32  |     1      |   100000    |    100000    |  65.435 us |       5.45% |  65.174 us |       4.00% |     -0.260 us |  -0.40% |   PASS   |
|  I32  |     1      |  10000000   |    100000    | 697.154 us |       0.36% | 625.882 us |       0.37% |    -71.272 us | -10.22% |   FAIL   |
|  I32  |     1      |  10000000   |   10000000   |   1.043 ms |       0.28% | 925.832 us |       0.27% |   -117.154 us | -11.23% |   FAIL   |
|  I64  |     0      |    1000     |     1000     |  41.621 us |       3.07% |  40.864 us |       2.73% |     -0.758 us |  -1.82% |   PASS   |
|  I64  |     0      |   100000    |     1000     | 141.435 us |       1.71% | 139.656 us |       1.47% |     -1.779 us |  -1.26% |   PASS   |
|  I64  |     0      |  10000000   |     1000     | 121.780 ms |       0.10% | 108.245 ms |       0.30% | -13535.139 us | -11.11% |   FAIL   |
|  I64  |     0      |   100000    |    100000    |  57.436 us |       2.32% |  54.937 us |       2.13% |     -2.499 us |  -4.35% |   FAIL   |
|  I64  |     0      |  10000000   |    100000    |   2.884 ms |       0.12% |   2.541 ms |       0.17% |   -343.763 us | -11.92% |   FAIL   |
|  I64  |     0      |  10000000   |   10000000   |   3.134 ms |       0.08% |   3.037 ms |       0.10% |    -97.289 us |  -3.10% |   FAIL   |
|  I64  |     1      |    1000     |     1000     |  61.852 us |       6.67% |  59.727 us |       6.78% |     -2.125 us |  -3.44% |   PASS   |
|  I64  |     1      |   100000    |     1000     |  78.340 us |       3.48% |  75.910 us |       3.75% |     -2.429 us |  -3.10% |   PASS   |
|  I64  |     1      |  10000000   |     1000     |   8.355 ms |       0.15% |   7.716 ms |       0.24% |   -639.140 us |  -7.65% |   FAIL   |
|  I64  |     1      |   100000    |    100000    |  67.700 us |       4.27% |  66.098 us |       4.21% |     -1.602 us |  -2.37% |   PASS   |
|  I64  |     1      |  10000000   |    100000    | 714.841 us |       0.41% | 637.604 us |       0.35% |    -77.238 us | -10.80% |   FAIL   |
|  I64  |     1      |  10000000   |   10000000   |   1.066 ms |       0.21% | 947.636 us |       0.33% |   -118.014 us | -11.07% |   FAIL   |