halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.89k stars 1.07k forks source link

[vulkan] Improve overall performance #7202

Open derek-gerstmann opened 1 year ago

derek-gerstmann commented 1 year ago

Specifically, reduce the number of wait calls, and remove any potential bottlenecks in the kernel submission. More importantly ... the performance_async_gpu test should pass!

Overall performance should be on par with other gpu backends like OpenCL, Metal, CUDA, etc.

mcourteaux commented 2 months ago

I got very suspicious when working on performance tests for fast arctan. My test results are very variable, and they seem to be going in increments:

JIT compiling fast_atan2_4 for x86-64-linux-tune_znver1-avx-avx2-f16c-fma-jit-sse41-user_context-vk_v13-vulkan
                  atan: 0.176901 ns per atan
 fast_atan (MAE 1e-02): 0.173092 ns per atan ( 2.2% faster)  [per invokation: 11.616016 ms]
 fast_atan (MAE 1e-03): 0.172300 ns per atan ( 2.6% faster)  [per invokation: 11.562847 ms]
 fast_atan (MAE 1e-04): 0.172323 ns per atan ( 2.6% faster)  [per invokation: 11.564432 ms]
 fast_atan (MAE 1e-05): 0.172705 ns per atan ( 2.4% faster)  [per invokation: 11.590015 ms]
 fast_atan (MAE 1e-06): 0.173716 ns per atan ( 1.8% faster)  [per invokation: 11.657883 ms]

                  atan2: 0.182086 ns per atan2
 fast_atan2 (MAE 1e-02): 0.174972 ns per atan2 ( 3.9% faster)  [per invokation: 11.742171 ms]
 fast_atan2 (MAE 1e-03): 0.174859 ns per atan2 ( 4.0% faster)  [per invokation: 11.734573 ms]
 fast_atan2 (MAE 1e-04): 0.175999 ns per atan2 ( 3.3% faster)  [per invokation: 11.811114 ms]
 fast_atan2 (MAE 1e-05): 0.176096 ns per atan2 ( 3.3% faster)  [per invokation: 11.817596 ms]
 fast_atan2 (MAE 1e-06): 0.176075 ns per atan2 ( 3.3% faster)  [per invokation: 11.816217 ms]

Another run:

                  atan: 0.176924 ns per atan
 fast_atan (MAE 1e-02): 0.172724 ns per atan ( 2.4% faster)  [per invokation: 11.591305 ms]
 fast_atan (MAE 1e-03): 0.173269 ns per atan ( 2.1% faster)  [per invokation: 11.627858 ms]
 fast_atan (MAE 1e-04): 0.174131 ns per atan ( 1.6% faster)  [per invokation: 11.685726 ms]
 fast_atan (MAE 1e-05): 0.173564 ns per atan ( 1.9% faster)  [per invokation: 11.647658 ms]
 fast_atan (MAE 1e-06): 0.346123 ns per atan (-95.6% faster)  [per invokation: 23.227917 ms]

                  atan2: 0.182132 ns per atan2
 fast_atan2 (MAE 1e-02): 0.175971 ns per atan2 ( 3.4% faster)  [per invokation: 11.809239 ms]
 fast_atan2 (MAE 1e-03): 0.175526 ns per atan2 ( 3.6% faster)  [per invokation: 11.779378 ms]
 fast_atan2 (MAE 1e-04): 0.176735 ns per atan2 ( 3.0% faster)  [per invokation: 11.860472 ms]
 fast_atan2 (MAE 1e-05): 0.177133 ns per atan2 ( 2.7% faster)  [per invokation: 11.887211 ms]
 fast_atan2 (MAE 1e-06): 0.360196 ns per atan2 (-97.8% faster)  [per invokation: 24.172320 ms]

They are all hovering round this 11.7ms time, and sometimes, when the test doesn't get the right performance, it goes almost neatly double of that: 24ms. Compare that to CUDA:

                  atan: 0.014434 ns per atan
 fast_atan (MAE 1e-02): 0.007271 ns per atan (49.6% faster)  [per invokation: 0.487923 ms]
 fast_atan (MAE 1e-03): 0.007490 ns per atan (48.1% faster)  [per invokation: 0.502641 ms]
 fast_atan (MAE 1e-04): 0.007792 ns per atan (46.0% faster)  [per invokation: 0.522928 ms]
 fast_atan (MAE 1e-05): 0.008710 ns per atan (39.7% faster)  [per invokation: 0.584539 ms]
 fast_atan (MAE 1e-06): 0.009016 ns per atan (37.5% faster)  [per invokation: 0.605042 ms]

                  atan2: 0.014800 ns per atan2
 fast_atan2 (MAE 1e-02): 0.009493 ns per atan2 (35.9% faster)  [per invokation: 0.637034 ms]
 fast_atan2 (MAE 1e-03): 0.009774 ns per atan2 (34.0% faster)  [per invokation: 0.655949 ms]
 fast_atan2 (MAE 1e-04): 0.010010 ns per atan2 (32.4% faster)  [per invokation: 0.671784 ms]
 fast_atan2 (MAE 1e-05): 0.010671 ns per atan2 (27.9% faster)  [per invokation: 0.716130 ms]
 fast_atan2 (MAE 1e-06): 0.010944 ns per atan2 (26.1% faster)  [per invokation: 0.734416 ms]
Success!

These neatly get gradually slower, and are about 20 times faster than Vulkan (or 40 times in case of the worst-case outliers).

I'm even thinking Vulkan is waiting on vsync or something...

mcourteaux commented 2 months ago

Hmm, perf shows calls to _atanf maybe I'm not even using the GPU... One CPU thread goes to 100%. NVIDIA-SMI doesn't show any activity. HL_DEBUG_CODEGEN=1 show that codegen does effectively produce SPIR-V... I'm puzzled...

derek-gerstmann commented 2 months ago

Testing this on main, using the existing atan methods with this trimmed down version of your performance test:

#include "Halide.h"
#include "halide_benchmark.h"

#ifndef M_PI
#define M_PI 3.14159265358979310000
#endif

using namespace Halide;
using namespace Halide::Tools;

int main(int argc, char **argv) {
    Target target = get_jit_target_from_environment();
    if (target.arch == Target::WebAssembly) {
        printf("[SKIP] Performance tests are meaningless and/or misleading under WebAssembly interpreter.\n");
        return 0;
    }
    if (target.has_feature(Target::WebGPU)) {
        printf("[SKIP] WebGPU seems to perform bad, and fast_atan is not really faster in all scenarios.\n");
        return 0;
    }

    Var x, y;
    const int test_w = 256;
    const int test_h = 256;

    Expr t0 = x / float(test_w);
    Expr t1 = y / float(test_h);

    // To make sure we time mostely the computation of the arctan, and not memory bandwidth,
    // we will compute many arctans per output and sum them. In my testing, GPUs suffer more
    // from bandwith with this test, so we give it more arctangenses to compute per output.

    const int test_d = target.has_gpu_feature() ? 1024 : 64;
    RDom rdom{0, test_d};
    Expr off = rdom / float(test_d) - 0.5f;

    float range = -10.0f;
    Func atan_ref{"atan_ref"}, atan2_ref{"atan2_ref"};
    atan_ref(x, y) = sum(atan(-range * t0 + (1 - t0) * range + off));
    atan2_ref(x, y) = sum(atan2(-range * t0 + (1 - t0) * range + off, -range * t1 + (1 - t1) * range));

    Var xo, xi;
    Var yo, yi;
    if (target.has_gpu_feature()) {
        atan_ref.never_partition_all();
        atan2_ref.never_partition_all();
        atan_ref.gpu_tile(x, y, xo, yo, xi, yi, 16, 16, TailStrategy::ShiftInwards);
        atan2_ref.gpu_tile(x, y, xo, yo, xi, yi, 16, 16, TailStrategy::ShiftInwards);
    } else {
        atan_ref.vectorize(x, 8);
        atan2_ref.vectorize(x, 8);
    }

    Tools::BenchmarkConfig cfg = {0.2, 1.0};
    double scale = 1e9 / (double(test_w) * (test_h * test_d));

    // clang-format off
    double t_atan  = scale * benchmark([&]() {  atan_ref.realize({test_w, test_h}); }, cfg);
    double t_atan2 = scale * benchmark([&]() { atan2_ref.realize({test_w, test_h}); }, cfg);
    // clang-format on

    printf("                  atan: %f ns per atan\n", t_atan);
    printf("                  atan2: %f ns per atan2\n", t_atan2);
    printf("Success!\n");
    return 0;
}
derek-gerstmann commented 2 months ago
> HL_SPIRV_DUMP_FILE=atan.spirv HL_JIT_TARGET="host-vulkan-vk_int8-vk_int16-vk_int64-vk_float16-vk_float64-vk_v13" ./build/test/performance/performance_fast_atan 
>  spirv-dis atan.spirv
; SPIR-V
; Version: 1.2
; Generator: Khronos; 0
; Bound: 107
; Schema: 0
               OpCapability Shader
         %53 = OpExtInstImport "GLSL.std.450"
               OpMemoryModel Logical GLSL450
               OpEntryPoint GLCompute %_kernel_atan2_ref_s0_v1_v9_block_id_y "_kernel_atan2_ref_s0_v1_v9_block_id_y" %k1_LocalInvocationId %k1_WorkgroupId
               OpExecutionMode %_kernel_atan2_ref_s0_v1_v9_block_id_y LocalSize 16 16 1
               OpName %_kernel_atan2_ref_s0_v1_v9_block_id_y "_kernel_atan2_ref_s0_v1_v9_block_id_y"
               OpName %k1_LocalInvocationId "k1_LocalInvocationId"
               OpName %k1_WorkgroupId "k1_WorkgroupId"
               OpName %k1_args_struct "k1_args_struct"
               OpName %k1_args_var "k1_args_var"
               OpName %k1_buffer_block1 "k1_buffer_block1"
               OpName %k1_atan2_ref "k1_atan2_ref"
               OpName %k1_sum_1_0 "k1_sum$1.0"
               OpName %k1_loop_idx_1 "k1_loop_idx$1"
               OpDecorate %k1_LocalInvocationId BuiltIn LocalInvocationId
               OpDecorate %k1_WorkgroupId BuiltIn WorkgroupId
               OpMemberDecorate %k1_args_struct 0 Offset 0
               OpMemberDecorate %k1_args_struct 1 Offset 4
               OpMemberDecorate %k1_args_struct 2 Offset 8
               OpMemberDecorate %k1_args_struct 3 Offset 12
               OpMemberDecorate %k1_args_struct 4 Offset 16
               OpMemberDecorate %k1_args_struct 5 Offset 20
               OpDecorate %k1_args_struct Block
               OpDecorate %k1_args_var DescriptorSet 0
               OpDecorate %k1_args_var Binding 0
               OpDecorate %_runtimearr_float ArrayStride 4
               OpDecorate %k1_buffer_block1 BufferBlock
               OpMemberDecorate %k1_buffer_block1 0 Offset 0
               OpDecorate %k1_atan2_ref DescriptorSet 0
               OpDecorate %k1_atan2_ref Binding 1
       %void = OpTypeVoid
          %4 = OpTypeFunction %void
       %uint = OpTypeInt 32 0
     %v3uint = OpTypeVector %uint 3
%_ptr_Input_v3uint = OpTypePointer Input %v3uint
        %int = OpTypeInt 32 1
%k1_args_struct = OpTypeStruct %int %int %int %int %int %int
%_ptr_Uniform_k1_args_struct = OpTypePointer Uniform %k1_args_struct
%_ptr_Uniform_int = OpTypePointer Uniform %int
      %float = OpTypeFloat 32
%_runtimearr_float = OpTypeRuntimeArray %float
%k1_buffer_block1 = OpTypeStruct %_runtimearr_float
%_ptr_Uniform_k1_buffer_block1 = OpTypePointer Uniform %k1_buffer_block1
%_ptr_Function_float = OpTypePointer Function %float
%_ptr_Function_int = OpTypePointer Function %int
       %bool = OpTypeBool
%_ptr_Uniform_float = OpTypePointer Uniform %float
     %uint_0 = OpConstant %uint 0
     %uint_1 = OpConstant %uint 1
     %uint_2 = OpConstant %uint 2
     %uint_3 = OpConstant %uint 3
     %uint_4 = OpConstant %uint 4
     %uint_5 = OpConstant %uint 5
     %int_16 = OpConstant %int 16
    %int_n16 = OpConstant %int -16
    %float_0 = OpConstant %float 0
      %int_0 = OpConstant %int 0
   %float_80 = OpConstant %float 80
%float_0_078125 = OpConstant %float 0.078125
  %float_n10 = OpConstant %float -10
   %int_1024 = OpConstant %int 1024
%float_0_0009765625 = OpConstant %float 0.0009765625
%float_n10_5 = OpConstant %float -10.5
      %int_1 = OpConstant %int 1
%k1_LocalInvocationId = OpVariable %_ptr_Input_v3uint Input
%k1_WorkgroupId = OpVariable %_ptr_Input_v3uint Input
%k1_args_var = OpVariable %_ptr_Uniform_k1_args_struct Uniform
%k1_atan2_ref = OpVariable %_ptr_Uniform_k1_buffer_block1 Uniform
%_kernel_atan2_ref_s0_v1_v9_block_id_y = OpFunction %void None %4
          %5 = OpLabel
 %k1_sum_1_0 = OpVariable %_ptr_Function_float Function
%k1_loop_idx_1 = OpVariable %_ptr_Function_int Function
         %10 = OpLoad %v3uint %k1_LocalInvocationId None
         %12 = OpLoad %v3uint %k1_WorkgroupId None
         %19 = OpInBoundsAccessChain %_ptr_Uniform_int %k1_args_var %uint_0
         %20 = OpLoad %int %19 None
         %22 = OpInBoundsAccessChain %_ptr_Uniform_int %k1_args_var %uint_1
         %23 = OpLoad %int %22 None
         %25 = OpInBoundsAccessChain %_ptr_Uniform_int %k1_args_var %uint_2
         %26 = OpLoad %int %25 None
         %28 = OpInBoundsAccessChain %_ptr_Uniform_int %k1_args_var %uint_3
         %29 = OpLoad %int %28 None
         %31 = OpInBoundsAccessChain %_ptr_Uniform_int %k1_args_var %uint_4
         %32 = OpLoad %int %31 None
         %34 = OpInBoundsAccessChain %_ptr_Uniform_int %k1_args_var %uint_5
         %35 = OpLoad %int %34 None
         %41 = OpCompositeExtract %uint %12 1
         %42 = OpBitcast %int %41
         %43 = OpCompositeExtract %uint %12 0
         %44 = OpBitcast %int %43
         %45 = OpCompositeExtract %uint %10 1
         %46 = OpBitcast %int %45
         %47 = OpCompositeExtract %uint %10 0
         %48 = OpBitcast %int %47
         %50 = OpIMul %int %42 %int_16
         %52 = OpIAdd %int %23 %int_n16
         %54 = OpExtInst %int %53 SMin %50 %52
         %57 = OpIMul %int %44 %int_16
         %58 = OpIAdd %int %20 %int_n16
         %59 = OpExtInst %int %53 SMin %57 %58
               OpStore %k1_sum_1_0 %float_0 None
         %62 = OpIAdd %int %26 %59
         %63 = OpIAdd %int %62 %48
         %64 = OpConvertSToF %float %63
         %66 = OpFMul %float %64 %float_80
         %67 = OpIAdd %int %29 %54
         %68 = OpIAdd %int %67 %46
         %69 = OpConvertSToF %float %68
         %71 = OpFMul %float %69 %float_0_078125
         %73 = OpFAdd %float %71 %float_n10
         %76 = OpIAdd %int %int_0 %int_1024
               OpStore %k1_loop_idx_1 %int_0 None
               OpBranch %78
         %78 = OpLabel
               OpLoopMerge %82 %81 DontUnroll
               OpBranch %79
         %79 = OpLabel
         %83 = OpLoad %int %k1_loop_idx_1 None
         %85 = OpULessThan %bool %83 %76
               OpBranchConditional %85 %80 %82
         %80 = OpLabel
         %86 = OpConvertSToF %float %83
         %87 = OpFAdd %float %66 %86
         %89 = OpFMul %float %87 %float_0_0009765625
         %91 = OpFAdd %float %89 %float_n10_5
         %92 = OpExtInst %float %53 Atan2 %91 %73
         %93 = OpLoad %float %k1_sum_1_0 None
         %94 = OpFAdd %float %92 %93
               OpStore %k1_sum_1_0 %94 None
               OpBranch %81
         %81 = OpLabel
         %97 = OpLoad %int %k1_loop_idx_1 None
         %95 = OpIAdd %int %97 %int_1
               OpStore %k1_loop_idx_1 %95 None
               OpBranch %78
         %82 = OpLabel
         %98 = OpLoad %float %k1_sum_1_0 None
         %99 = OpIAdd %int %29 %54
        %100 = OpIAdd %int %99 %46
        %101 = OpIMul %int %100 %32
        %102 = OpIAdd %int %59 %35
        %103 = OpIAdd %int %101 %102
        %104 = OpIAdd %int %103 %48
        %106 = OpInBoundsAccessChain %_ptr_Uniform_float %k1_atan2_ref %uint_0 %104
               OpStore %106 %98 None
               OpReturn
               OpFunctionEnd

So the current atan is getting mapped to the native SPIRV atan instruction (see %92).

derek-gerstmann commented 2 months ago

Running this I'm getting the following on a NVIDIA RTX 3070 Ti ...

> HL_JIT_TARGET="host-vulkan-vk_int8-vk_int16-vk_int64-vk_float16-vk_float64-vk_v13" ./build/test/performance/performance_fast_atan 
                  atan: 0.077941 ns per atan
                  atan2: 0.081444 ns per atan2
Success!

And for Cuda ...

> HL_JIT_TARGET="host-cuda" ./build/test/performance/performance_fast_atan 
                  atan: 0.005477 ns per atan
                  atan2: 0.007064 ns per atan2
Success!

However, the test is calling realize({dimx, dimy}) which will compile and cache on the first call, and allocate and cache the output buffer. So the overhead is significant for this type of test.

derek-gerstmann commented 2 months ago

If I change the benchmarking code to compile first, and use existing buffer allocations, and sync the device in the loop like so ...

...

    atan_ref.compile_jit();
    atan2_ref.compile_jit();

    Buffer<float> atan_out(test_w, test_h);
    Buffer<float> atan2_out(test_w, test_h);

    Tools::BenchmarkConfig cfg = {0.2, 1.0};
    double scale = 1e9 / (double(test_w) * (test_h * test_d));

    // clang-format off
    double t_atan  = scale * benchmark([&]() {  atan_ref.realize(atan_out); atan_out.device_sync(); }, cfg);
    double t_atan2 = scale * benchmark([&]() { atan2_ref.realize(atan2_out); atan2_out.device_sync(); }, cfg);
    // clang-format on
...

The runtimes are much closer:

> HL_JIT_TARGET="host-vulkan-vk_int8-vk_int16-vk_int64-vk_float16-vk_float64-vk_v13" ./build/test/performance/performance_fast_atan 
                  atan: 0.004023 ns per atan
                  atan2: 0.007173 ns per atan2
Success!
> HL_JIT_TARGET="x86-64-linux-tune_znver3-avx-avx2-f16c-fma-sse41-cuda" ./build/test/performance/performance_fast_atan 
                  atan: 0.005034 ns per atan
                  atan2: 0.006537 ns per atan2
Success!
mcourteaux commented 2 months ago

Thanks a lot, will update the benchmark. Perhaps this fixes the WebGPU slowness as well...