Open derek-gerstmann opened 1 year ago
I got very suspicious when working on performance tests for fast arctan. My test results are very variable, and they seem to be going in increments:
JIT compiling fast_atan2_4 for x86-64-linux-tune_znver1-avx-avx2-f16c-fma-jit-sse41-user_context-vk_v13-vulkan
atan: 0.176901 ns per atan
fast_atan (MAE 1e-02): 0.173092 ns per atan ( 2.2% faster) [per invokation: 11.616016 ms]
fast_atan (MAE 1e-03): 0.172300 ns per atan ( 2.6% faster) [per invokation: 11.562847 ms]
fast_atan (MAE 1e-04): 0.172323 ns per atan ( 2.6% faster) [per invokation: 11.564432 ms]
fast_atan (MAE 1e-05): 0.172705 ns per atan ( 2.4% faster) [per invokation: 11.590015 ms]
fast_atan (MAE 1e-06): 0.173716 ns per atan ( 1.8% faster) [per invokation: 11.657883 ms]
atan2: 0.182086 ns per atan2
fast_atan2 (MAE 1e-02): 0.174972 ns per atan2 ( 3.9% faster) [per invokation: 11.742171 ms]
fast_atan2 (MAE 1e-03): 0.174859 ns per atan2 ( 4.0% faster) [per invokation: 11.734573 ms]
fast_atan2 (MAE 1e-04): 0.175999 ns per atan2 ( 3.3% faster) [per invokation: 11.811114 ms]
fast_atan2 (MAE 1e-05): 0.176096 ns per atan2 ( 3.3% faster) [per invokation: 11.817596 ms]
fast_atan2 (MAE 1e-06): 0.176075 ns per atan2 ( 3.3% faster) [per invokation: 11.816217 ms]
Another run:
atan: 0.176924 ns per atan
fast_atan (MAE 1e-02): 0.172724 ns per atan ( 2.4% faster) [per invokation: 11.591305 ms]
fast_atan (MAE 1e-03): 0.173269 ns per atan ( 2.1% faster) [per invokation: 11.627858 ms]
fast_atan (MAE 1e-04): 0.174131 ns per atan ( 1.6% faster) [per invokation: 11.685726 ms]
fast_atan (MAE 1e-05): 0.173564 ns per atan ( 1.9% faster) [per invokation: 11.647658 ms]
fast_atan (MAE 1e-06): 0.346123 ns per atan (-95.6% faster) [per invokation: 23.227917 ms]
atan2: 0.182132 ns per atan2
fast_atan2 (MAE 1e-02): 0.175971 ns per atan2 ( 3.4% faster) [per invokation: 11.809239 ms]
fast_atan2 (MAE 1e-03): 0.175526 ns per atan2 ( 3.6% faster) [per invokation: 11.779378 ms]
fast_atan2 (MAE 1e-04): 0.176735 ns per atan2 ( 3.0% faster) [per invokation: 11.860472 ms]
fast_atan2 (MAE 1e-05): 0.177133 ns per atan2 ( 2.7% faster) [per invokation: 11.887211 ms]
fast_atan2 (MAE 1e-06): 0.360196 ns per atan2 (-97.8% faster) [per invokation: 24.172320 ms]
They are all hovering round this 11.7ms time, and sometimes, when the test doesn't get the right performance, it goes almost neatly double of that: 24ms. Compare that to CUDA:
atan: 0.014434 ns per atan
fast_atan (MAE 1e-02): 0.007271 ns per atan (49.6% faster) [per invokation: 0.487923 ms]
fast_atan (MAE 1e-03): 0.007490 ns per atan (48.1% faster) [per invokation: 0.502641 ms]
fast_atan (MAE 1e-04): 0.007792 ns per atan (46.0% faster) [per invokation: 0.522928 ms]
fast_atan (MAE 1e-05): 0.008710 ns per atan (39.7% faster) [per invokation: 0.584539 ms]
fast_atan (MAE 1e-06): 0.009016 ns per atan (37.5% faster) [per invokation: 0.605042 ms]
atan2: 0.014800 ns per atan2
fast_atan2 (MAE 1e-02): 0.009493 ns per atan2 (35.9% faster) [per invokation: 0.637034 ms]
fast_atan2 (MAE 1e-03): 0.009774 ns per atan2 (34.0% faster) [per invokation: 0.655949 ms]
fast_atan2 (MAE 1e-04): 0.010010 ns per atan2 (32.4% faster) [per invokation: 0.671784 ms]
fast_atan2 (MAE 1e-05): 0.010671 ns per atan2 (27.9% faster) [per invokation: 0.716130 ms]
fast_atan2 (MAE 1e-06): 0.010944 ns per atan2 (26.1% faster) [per invokation: 0.734416 ms]
Success!
These neatly get gradually slower, and are about 20 times faster than Vulkan (or 40 times in case of the worst-case outliers).
I'm even thinking Vulkan is waiting on vsync or something...
Hmm, perf shows calls to _atanf
maybe I'm not even using the GPU... One CPU thread goes to 100%. NVIDIA-SMI doesn't show any activity. HL_DEBUG_CODEGEN=1 show that codegen does effectively produce SPIR-V... I'm puzzled...
Testing this on main, using the existing atan methods with this trimmed down version of your performance test:
#include "Halide.h"
#include "halide_benchmark.h"
#ifndef M_PI
#define M_PI 3.14159265358979310000
#endif
using namespace Halide;
using namespace Halide::Tools;
int main(int argc, char **argv) {
Target target = get_jit_target_from_environment();
if (target.arch == Target::WebAssembly) {
printf("[SKIP] Performance tests are meaningless and/or misleading under WebAssembly interpreter.\n");
return 0;
}
if (target.has_feature(Target::WebGPU)) {
printf("[SKIP] WebGPU seems to perform bad, and fast_atan is not really faster in all scenarios.\n");
return 0;
}
Var x, y;
const int test_w = 256;
const int test_h = 256;
Expr t0 = x / float(test_w);
Expr t1 = y / float(test_h);
// To make sure we time mostely the computation of the arctan, and not memory bandwidth,
// we will compute many arctans per output and sum them. In my testing, GPUs suffer more
// from bandwith with this test, so we give it more arctangenses to compute per output.
const int test_d = target.has_gpu_feature() ? 1024 : 64;
RDom rdom{0, test_d};
Expr off = rdom / float(test_d) - 0.5f;
float range = -10.0f;
Func atan_ref{"atan_ref"}, atan2_ref{"atan2_ref"};
atan_ref(x, y) = sum(atan(-range * t0 + (1 - t0) * range + off));
atan2_ref(x, y) = sum(atan2(-range * t0 + (1 - t0) * range + off, -range * t1 + (1 - t1) * range));
Var xo, xi;
Var yo, yi;
if (target.has_gpu_feature()) {
atan_ref.never_partition_all();
atan2_ref.never_partition_all();
atan_ref.gpu_tile(x, y, xo, yo, xi, yi, 16, 16, TailStrategy::ShiftInwards);
atan2_ref.gpu_tile(x, y, xo, yo, xi, yi, 16, 16, TailStrategy::ShiftInwards);
} else {
atan_ref.vectorize(x, 8);
atan2_ref.vectorize(x, 8);
}
Tools::BenchmarkConfig cfg = {0.2, 1.0};
double scale = 1e9 / (double(test_w) * (test_h * test_d));
// clang-format off
double t_atan = scale * benchmark([&]() { atan_ref.realize({test_w, test_h}); }, cfg);
double t_atan2 = scale * benchmark([&]() { atan2_ref.realize({test_w, test_h}); }, cfg);
// clang-format on
printf(" atan: %f ns per atan\n", t_atan);
printf(" atan2: %f ns per atan2\n", t_atan2);
printf("Success!\n");
return 0;
}
> HL_SPIRV_DUMP_FILE=atan.spirv HL_JIT_TARGET="host-vulkan-vk_int8-vk_int16-vk_int64-vk_float16-vk_float64-vk_v13" ./build/test/performance/performance_fast_atan
> spirv-dis atan.spirv
; SPIR-V
; Version: 1.2
; Generator: Khronos; 0
; Bound: 107
; Schema: 0
OpCapability Shader
%53 = OpExtInstImport "GLSL.std.450"
OpMemoryModel Logical GLSL450
OpEntryPoint GLCompute %_kernel_atan2_ref_s0_v1_v9_block_id_y "_kernel_atan2_ref_s0_v1_v9_block_id_y" %k1_LocalInvocationId %k1_WorkgroupId
OpExecutionMode %_kernel_atan2_ref_s0_v1_v9_block_id_y LocalSize 16 16 1
OpName %_kernel_atan2_ref_s0_v1_v9_block_id_y "_kernel_atan2_ref_s0_v1_v9_block_id_y"
OpName %k1_LocalInvocationId "k1_LocalInvocationId"
OpName %k1_WorkgroupId "k1_WorkgroupId"
OpName %k1_args_struct "k1_args_struct"
OpName %k1_args_var "k1_args_var"
OpName %k1_buffer_block1 "k1_buffer_block1"
OpName %k1_atan2_ref "k1_atan2_ref"
OpName %k1_sum_1_0 "k1_sum$1.0"
OpName %k1_loop_idx_1 "k1_loop_idx$1"
OpDecorate %k1_LocalInvocationId BuiltIn LocalInvocationId
OpDecorate %k1_WorkgroupId BuiltIn WorkgroupId
OpMemberDecorate %k1_args_struct 0 Offset 0
OpMemberDecorate %k1_args_struct 1 Offset 4
OpMemberDecorate %k1_args_struct 2 Offset 8
OpMemberDecorate %k1_args_struct 3 Offset 12
OpMemberDecorate %k1_args_struct 4 Offset 16
OpMemberDecorate %k1_args_struct 5 Offset 20
OpDecorate %k1_args_struct Block
OpDecorate %k1_args_var DescriptorSet 0
OpDecorate %k1_args_var Binding 0
OpDecorate %_runtimearr_float ArrayStride 4
OpDecorate %k1_buffer_block1 BufferBlock
OpMemberDecorate %k1_buffer_block1 0 Offset 0
OpDecorate %k1_atan2_ref DescriptorSet 0
OpDecorate %k1_atan2_ref Binding 1
%void = OpTypeVoid
%4 = OpTypeFunction %void
%uint = OpTypeInt 32 0
%v3uint = OpTypeVector %uint 3
%_ptr_Input_v3uint = OpTypePointer Input %v3uint
%int = OpTypeInt 32 1
%k1_args_struct = OpTypeStruct %int %int %int %int %int %int
%_ptr_Uniform_k1_args_struct = OpTypePointer Uniform %k1_args_struct
%_ptr_Uniform_int = OpTypePointer Uniform %int
%float = OpTypeFloat 32
%_runtimearr_float = OpTypeRuntimeArray %float
%k1_buffer_block1 = OpTypeStruct %_runtimearr_float
%_ptr_Uniform_k1_buffer_block1 = OpTypePointer Uniform %k1_buffer_block1
%_ptr_Function_float = OpTypePointer Function %float
%_ptr_Function_int = OpTypePointer Function %int
%bool = OpTypeBool
%_ptr_Uniform_float = OpTypePointer Uniform %float
%uint_0 = OpConstant %uint 0
%uint_1 = OpConstant %uint 1
%uint_2 = OpConstant %uint 2
%uint_3 = OpConstant %uint 3
%uint_4 = OpConstant %uint 4
%uint_5 = OpConstant %uint 5
%int_16 = OpConstant %int 16
%int_n16 = OpConstant %int -16
%float_0 = OpConstant %float 0
%int_0 = OpConstant %int 0
%float_80 = OpConstant %float 80
%float_0_078125 = OpConstant %float 0.078125
%float_n10 = OpConstant %float -10
%int_1024 = OpConstant %int 1024
%float_0_0009765625 = OpConstant %float 0.0009765625
%float_n10_5 = OpConstant %float -10.5
%int_1 = OpConstant %int 1
%k1_LocalInvocationId = OpVariable %_ptr_Input_v3uint Input
%k1_WorkgroupId = OpVariable %_ptr_Input_v3uint Input
%k1_args_var = OpVariable %_ptr_Uniform_k1_args_struct Uniform
%k1_atan2_ref = OpVariable %_ptr_Uniform_k1_buffer_block1 Uniform
%_kernel_atan2_ref_s0_v1_v9_block_id_y = OpFunction %void None %4
%5 = OpLabel
%k1_sum_1_0 = OpVariable %_ptr_Function_float Function
%k1_loop_idx_1 = OpVariable %_ptr_Function_int Function
%10 = OpLoad %v3uint %k1_LocalInvocationId None
%12 = OpLoad %v3uint %k1_WorkgroupId None
%19 = OpInBoundsAccessChain %_ptr_Uniform_int %k1_args_var %uint_0
%20 = OpLoad %int %19 None
%22 = OpInBoundsAccessChain %_ptr_Uniform_int %k1_args_var %uint_1
%23 = OpLoad %int %22 None
%25 = OpInBoundsAccessChain %_ptr_Uniform_int %k1_args_var %uint_2
%26 = OpLoad %int %25 None
%28 = OpInBoundsAccessChain %_ptr_Uniform_int %k1_args_var %uint_3
%29 = OpLoad %int %28 None
%31 = OpInBoundsAccessChain %_ptr_Uniform_int %k1_args_var %uint_4
%32 = OpLoad %int %31 None
%34 = OpInBoundsAccessChain %_ptr_Uniform_int %k1_args_var %uint_5
%35 = OpLoad %int %34 None
%41 = OpCompositeExtract %uint %12 1
%42 = OpBitcast %int %41
%43 = OpCompositeExtract %uint %12 0
%44 = OpBitcast %int %43
%45 = OpCompositeExtract %uint %10 1
%46 = OpBitcast %int %45
%47 = OpCompositeExtract %uint %10 0
%48 = OpBitcast %int %47
%50 = OpIMul %int %42 %int_16
%52 = OpIAdd %int %23 %int_n16
%54 = OpExtInst %int %53 SMin %50 %52
%57 = OpIMul %int %44 %int_16
%58 = OpIAdd %int %20 %int_n16
%59 = OpExtInst %int %53 SMin %57 %58
OpStore %k1_sum_1_0 %float_0 None
%62 = OpIAdd %int %26 %59
%63 = OpIAdd %int %62 %48
%64 = OpConvertSToF %float %63
%66 = OpFMul %float %64 %float_80
%67 = OpIAdd %int %29 %54
%68 = OpIAdd %int %67 %46
%69 = OpConvertSToF %float %68
%71 = OpFMul %float %69 %float_0_078125
%73 = OpFAdd %float %71 %float_n10
%76 = OpIAdd %int %int_0 %int_1024
OpStore %k1_loop_idx_1 %int_0 None
OpBranch %78
%78 = OpLabel
OpLoopMerge %82 %81 DontUnroll
OpBranch %79
%79 = OpLabel
%83 = OpLoad %int %k1_loop_idx_1 None
%85 = OpULessThan %bool %83 %76
OpBranchConditional %85 %80 %82
%80 = OpLabel
%86 = OpConvertSToF %float %83
%87 = OpFAdd %float %66 %86
%89 = OpFMul %float %87 %float_0_0009765625
%91 = OpFAdd %float %89 %float_n10_5
%92 = OpExtInst %float %53 Atan2 %91 %73
%93 = OpLoad %float %k1_sum_1_0 None
%94 = OpFAdd %float %92 %93
OpStore %k1_sum_1_0 %94 None
OpBranch %81
%81 = OpLabel
%97 = OpLoad %int %k1_loop_idx_1 None
%95 = OpIAdd %int %97 %int_1
OpStore %k1_loop_idx_1 %95 None
OpBranch %78
%82 = OpLabel
%98 = OpLoad %float %k1_sum_1_0 None
%99 = OpIAdd %int %29 %54
%100 = OpIAdd %int %99 %46
%101 = OpIMul %int %100 %32
%102 = OpIAdd %int %59 %35
%103 = OpIAdd %int %101 %102
%104 = OpIAdd %int %103 %48
%106 = OpInBoundsAccessChain %_ptr_Uniform_float %k1_atan2_ref %uint_0 %104
OpStore %106 %98 None
OpReturn
OpFunctionEnd
So the current atan
is getting mapped to the native SPIRV atan
instruction (see %92).
Running this I'm getting the following on a NVIDIA RTX 3070 Ti ...
> HL_JIT_TARGET="host-vulkan-vk_int8-vk_int16-vk_int64-vk_float16-vk_float64-vk_v13" ./build/test/performance/performance_fast_atan
atan: 0.077941 ns per atan
atan2: 0.081444 ns per atan2
Success!
And for Cuda ...
> HL_JIT_TARGET="host-cuda" ./build/test/performance/performance_fast_atan
atan: 0.005477 ns per atan
atan2: 0.007064 ns per atan2
Success!
However, the test is calling realize({dimx, dimy}) which will compile and cache on the first call, and allocate and cache the output buffer. So the overhead is significant for this type of test.
If I change the benchmarking code to compile first, and use existing buffer allocations, and sync the device in the loop like so ...
...
atan_ref.compile_jit();
atan2_ref.compile_jit();
Buffer<float> atan_out(test_w, test_h);
Buffer<float> atan2_out(test_w, test_h);
Tools::BenchmarkConfig cfg = {0.2, 1.0};
double scale = 1e9 / (double(test_w) * (test_h * test_d));
// clang-format off
double t_atan = scale * benchmark([&]() { atan_ref.realize(atan_out); atan_out.device_sync(); }, cfg);
double t_atan2 = scale * benchmark([&]() { atan2_ref.realize(atan2_out); atan2_out.device_sync(); }, cfg);
// clang-format on
...
The runtimes are much closer:
> HL_JIT_TARGET="host-vulkan-vk_int8-vk_int16-vk_int64-vk_float16-vk_float64-vk_v13" ./build/test/performance/performance_fast_atan
atan: 0.004023 ns per atan
atan2: 0.007173 ns per atan2
Success!
> HL_JIT_TARGET="x86-64-linux-tune_znver3-avx-avx2-f16c-fma-sse41-cuda" ./build/test/performance/performance_fast_atan
atan: 0.005034 ns per atan
atan2: 0.006537 ns per atan2
Success!
Thanks a lot, will update the benchmark. Perhaps this fixes the WebGPU slowness as well...
Specifically, reduce the number of wait calls, and remove any potential bottlenecks in the kernel submission. More importantly ... the performance_async_gpu test should pass!
Overall performance should be on par with other gpu backends like OpenCL, Metal, CUDA, etc.