Is there a way to remove the cudaMalloc overhead in benchmark?

iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.

http://iree.dev/

Apache License 2.0

2.56k stars 574 forks source link

Is there a way to remove the cudaMalloc overhead in benchmark? #17012

Open Pzzzzz5142 opened 5 months ago

Pzzzzz5142 commented 5 months ago

Request description

When I run the benchmark for matmul on CUDA, iree reports unreasonably bad performance. According to Nsight System, it seems like although iree has pretty good kernel performance, however, since it includes the cudaMalloc overhead, so it reports really bad perf.

IREE result:

---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_forward/process_time/real_time             0.679 ms        0.702 ms          967 items_per_second=1.47236k/s
BM_forward/process_time/real_time             0.692 ms        0.724 ms          967 items_per_second=1.44408k/s
BM_forward/process_time/real_time             0.719 ms        0.748 ms          967 items_per_second=1.39038k/s
BM_forward/process_time/real_time             0.800 ms        0.765 ms          967 items_per_second=1.25066k/s
BM_forward/process_time/real_time             0.648 ms        0.675 ms          967 items_per_second=1.54436k/s
BM_forward/process_time/real_time             0.733 ms        0.760 ms          967 items_per_second=1.36352k/s
BM_forward/process_time/real_time             0.716 ms        0.739 ms          967 items_per_second=1.39665k/s
BM_forward/process_time/real_time             0.658 ms        0.678 ms          967 items_per_second=1.52045k/s
BM_forward/process_time/real_time             0.698 ms        0.727 ms          967 items_per_second=1.43309k/s
BM_forward/process_time/real_time             0.775 ms        0.795 ms          967 items_per_second=1.2903k/s
BM_forward/process_time/real_time_mean        0.712 ms        0.731 ms           10 items_per_second=1.41059k/s
BM_forward/process_time/real_time_median      0.707 ms        0.733 ms           10 items_per_second=1.41487k/s
BM_forward/process_time/real_time_stddev      0.048 ms        0.038 ms           10 items_per_second=93.2595/s
BM_forward/process_time/real_time_cv           6.78 %          5.26 %            10 items_per_second=6.61%

I'm wondering wether is it possible to remove the cudaMalloc overhead?

What component(s) does this issue relate to?

Other

Additional context

No response

ScottTodd commented 5 months ago

What are you running specifically for "the benchmark for matmul on CUDA"? A profile/trace (or larger screenshot from nsight) would help qualify what you are measuring.

One thing you can try is running with --device_allocator=caching. That caches allocations between runs, which is useful for static workloads during benchmarking.

Pzzzzz5142 commented 5 months ago

What are you running specifically for "the benchmark for matmul on CUDA"?

I'm trying to measure the bare matmul performance of iree. So I wrote a pytorch model which only contains nn.Linear and export it to tosa dialect using torch-mlir. And then use this command to compile

./iree-build/tools/iree-compile --iree-hal-target-backends=cuda --iree-hal-cuda-llvm-target-arch=sm_86 -o linear linear.mlir

And this command to profile the performance.

./iree-build/tools/iree-benchmark-module --module=linear.vmfb \
    --iree-hal-target-backends=cuda --device=cuda://0 \
    --function=forward \
    --device_allocator=caching \
    --input=1x64x768xf16=-1 --benchmark_repetitions=10

However, adding --device_allocator=caching would still produce the similar result.

---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_forward/process_time/real_time              1.02 ms         1.14 ms          662 items_per_second=978.87/s
BM_forward/process_time/real_time             0.847 ms        0.911 ms          662 items_per_second=1.18089k/s
BM_forward/process_time/real_time             0.795 ms        0.856 ms          662 items_per_second=1.25826k/s
BM_forward/process_time/real_time             0.686 ms        0.761 ms          662 items_per_second=1.4585k/s
BM_forward/process_time/real_time             0.429 ms        0.469 ms          662 items_per_second=2.33169k/s
BM_forward/process_time/real_time             0.215 ms        0.252 ms          662 items_per_second=4.66178k/s
BM_forward/process_time/real_time             0.584 ms        0.628 ms          662 items_per_second=1.71309k/s
BM_forward/process_time/real_time             0.938 ms         1.04 ms          662 items_per_second=1.06586k/s
BM_forward/process_time/real_time             0.517 ms        0.570 ms          662 items_per_second=1.93588k/s
BM_forward/process_time/real_time             0.473 ms        0.521 ms          662 items_per_second=2.11314k/s
BM_forward/process_time/real_time_mean        0.650 ms        0.715 ms           10 items_per_second=1.8698k/s
BM_forward/process_time/real_time_median      0.635 ms        0.694 ms           10 items_per_second=1.5858k/s
BM_forward/process_time/real_time_stddev      0.253 ms        0.277 ms           10 items_per_second=1.08292k/s
BM_forward/process_time/real_time_cv          38.83 %         38.70 %            10 items_per_second=57.92%

Since it will still has a lot of cuMemAlloc.

While the kernel time is at about an average of ~0.0041 ms.

Although a little bit launch overhead is accepted (especially for this small shape case), the overhead of iree benchmark is quite large. Comparing to pytorch, it only has about 0.031ms launch overhead with a average kernel perf of ~0.0046ms

Can I remove the cudaMalloc completely? or at least exclude it from the perf report so that I can get the more accurate kernel performance.

stellaraccident commented 5 months ago

@benvanik something seems off here. I wouldn't expect such a simple example to be doing intermediate allocs, and if it isn't, then it seems like we might be benchmarking result allocs. If this were being invoked in something like pytorch, we would be using the calling convention for pre-allocated result buffers. I expect that since benchmark-module is not doing that, the allocation is happening "inside" and being measured. Am I thinking about this right? Is there a way with benchmark-module to have it pass a tied result buffer slab to simulate how this would be done in a real integration?

Pzzzzz5142 commented 5 months ago

Thanks for the help! FYI This zip contains the original mlir file and the nsys report of pytorch and iree. linear_64_768_768.zip

benvanik commented 5 months ago

iree-benchmark-module precisely matches the behavior of iree-run-module - there's no special behavior there (and won't be) - so best to think of how your program runs and what it requires and whether it's through iree-run-module or iree-benchmark-module just as a difference in whether we run it once and print out results or run it many times and print out statistics.

Thus, if your program allocates result buffers before returning iree-benchmark-module will benchmark those allocations - this is what allows it to be a useful benchmarking tool for what you'd expect if you ran it outside of benchmarking. There won't be any additional allocations in iree-benchmark-module than when it would normally run as it's exactly the same.

If you want to avoid the result allocations you can pass in buffers - but this is orthogonal to benchmarking - if you define your program to take in output buffers and modify the IR to use them then both iree-run-module and iree-benchmark-module (and whatever you'd use to run the program from your application) will have identical behavior and require you to pass the output buffers in.

Now all that said, unless you are running one-shot you're going to want the caching allocator and you're going to want it everywhere (benchmarking and not). So if you pass --device_allocator=caching to iree-benchmark-module (and you should) then you also want to pass it to iree-run-module and do the same thing (configure the caching allocator) in your hosting application. That's because we behave consistently across all surfaces and don't have special behavior based on how something is run.

As for how to use output buffers you have two choices: the easy way to hack into IR and the better way that you change your frontend for. To hack the IR, for any result you want to store into a provided buffer you add one !hal.buffer (or !hal.buffer_view) function argument with the iree.abi.output attribute specifying which result it is the storage for. So:

util.func @foo(%arg0: tensor<?x8x8x3xf32>, %ret0: !hal.buffer {iree.abi.output = 0 : index}, %ret1: !hal.buffer {iree.abi.output = 1 : index}) ->
    (tensor<?x8x8x3xf32>, tensor<?x8x8x3xf32>) {

Note that you must still return the results - only when using this what's returned is just a reference to the storage with its final shape instead of an allocation of a device buffer - otherwise it's the same.

When calling you can then allocate one buffer per result yourself, or ideally one large buffer and use iree_hal_buffer_subspan to slice it up and pass it in. In the tools you can use the & prefix to indicate that an input is passed by-reference:

$ iree-run-module (or iree-benchmark-module) --input=1x8x8x3xf32=100 --input=&1x8x8x3xf32 --input=&1x8x8x3xf32

You can also initialize the inputs if doing in-place updates by giving them values.

The best way to do it, though, is to set up the ABI from the frontend using the hal.tensor.import/hal.tensor.export ops instead of the attributes. This lets you group multiple results in the same buffers.

stellaraccident commented 5 months ago

Thanks Ben, I was being imprecise when I was saying "make benchmark-module" do this. What I was actually asking was how to do the output buffer thing in the program and then benchmark that.

I've got it on my list to add this properly to the torch frontend as I need it for some serving cases (and the previous times I've done this it has been ad-hoc).

But as you say, it should still be possible to get approximately what is being asked for with the caching allocator, and I'm wondering if there is a problem here or more things to try.

Pzzzzz5142 commented 5 months ago

Thanks for the really detailed explanation. If I understand correctly, simply using the aforementioned compile command would not move the result allocation to a pre allocated buffer. Is this possible to reuse such conversion as a WAR? If not(a naive question), how to lower the exported TOSA dialect to the iree dialect so that I can modify the result buffer behavior and then compile it to vmfb file? Thanks!

stellaraccident commented 5 months ago

Yeah, the pipelines defaulting to generate programs with all results allocated internally is a bias from a long time back -- and it is a lot easier to write with fewer sharp edges. For a lot of big workloads/graphs, the impact is much less than if trying to do a single kernel launch with tight constraints. The tools should work for that, but it isn't the dominant case that folks are generally working on.

The TOSA input pipeline is mostly just taking the defaults, so it should convert func.func to the internal util.func and preserve those ABI attributes. You can just hack it directly.

For example, I changed the func line to this:

  func.func @forward(%arg0: tensor<1x64x768xf16>, %arg1: !hal.buffer {iree.abi.output = 0 : index}) -> tensor<1x64x768xf16> {

Then ran this:

iree-compile --iree-hal-target-backends=llvm-cpu iree_linear_64_768_768.mlir -o /dev/null --mlir-print-ir-after-all --mlir-elide-elementsattrs-if-larger=100 2>&1 | less

And paged down to WrapEntryPointsPass to see how that got converted. There you see this:

  util.func public @forward(%arg0: !hal.buffer_view, %arg1: !hal.buffer) -> !hal.buffer_view attributes {iree.abi.stub, iree.reflection = {iree.abi.declaration = "sync func @forward(%input0: tensor<1x64x768xf16>, %input1: !hal.buffer {iree.abi.output = 0 : index}) -> (%output0: tensor<1x64x768xf16>)"}} {
    %0 = hal.tensor.import %arg0 "input0" : !hal.buffer_view -> tensor<1x64x768xf16>
    %1 = util.call @_forward(%0, %arg1) : (tensor<1x64x768xf16>, !hal.buffer) -> tensor<1x64x768xf16>
    %2 = hal.tensor.export %1 "output0" into(%arg1 : !hal.buffer) : tensor<1x64x768xf16> -> !hal.buffer_view
    util.return %2 : !hal.buffer_view
  }
  util.func private @_forward(%arg0: tensor<1x64x768xf16>, %arg1: !hal.buffer {iree.abi.output = 0 : index}) -> tensor<1x64x768xf16> {
    %cst = arith.constant 0.000000e+00 : f16
    %cst_0 = arith.constant dense_resource<__elided__> : tensor<768x768xf16>
    %0 = tensor.empty() : tensor<768x768xf16>
    %transposed = linalg.transpose ins(%cst_0 : tensor<768x768xf16>) outs(%0 : tensor<768x768xf16>) permutation = [1, 0] 
    %expanded = tensor.expand_shape %transposed [[0, 1], [2]] : tensor<768x768xf16> into tensor<1x768x768xf16>
    %1 = tensor.empty() : tensor<1x64x768xf16>
    %2 = linalg.fill ins(%cst : f16) outs(%1 : tensor<1x64x768xf16>) -> tensor<1x64x768xf16>
    %3 = linalg.batch_matmul ins(%arg0, %expanded : tensor<1x64x768xf16>, tensor<1x768x768xf16>) outs(%2 : tensor<1x64x768xf16>) -> tensor<1x64x768xf16>
    util.return %3 : tensor<1x64x768xf16>
  }

Note the hal.tensor.export into the provided buffer. This is the default behavior Ben was describing.

I haven't actually tried running it, but this looks like the compiled form I would expect. I still think you should be able to get a result that is doing approximately the same work as the Torch version using allocator settings (which is how Torch amortizes its allocations).

stellaraccident commented 5 months ago

While not needed in this case, if you want to hack on internal dialect forms (vs TOSA/frontend), you can use --compile-to. Can help when poking on things that aren't exposed to all frontends. Example:

iree-compile --iree-hal-target-backends=llvm-cpu iree_linear_64_768_768.mlir --compile-to=input --mlir-elide-elementsattrs-if-larger=100

Pzzzzz5142 commented 5 months ago

Great! Thanks for the example! I'll try it myself. Really appreciate the help.

As for this issue, I'll leave it open for further discussion on what can iree do for the output buffer benchmarking. Feel free to close it.

stellaraccident commented 5 months ago

Let us know your findings. There's too much tribal knowledge on this, and we should at least close the issues with some docs or a sample. But would like to document something that we know works :)

Pzzzzz5142 commented 5 months ago

Let us know your findings.

Sure👌

benvanik commented 5 months ago

Note that providing output storage is largely incompatible with dynamic shapes, conditional execution, or proper async (which is why we don't do it by default). We do support using less storage (so if you pass in a buffer that could hold tensor<4096xf32> but only produce a tensor<4xf32> we'll just store that and return the buffer view with the shape 4xf32). In async mode the disadvantage is that you preallocate and wire the memory before you begin running instead of letting it be allocated on-demand (possibly reusing input buffers that are dead after being consumed). Basically, output args are a special case that only works in certain situations - those may be always for certain users (static shapes only, already existing persistent storage, synchronous execution or their own scheduling) but not something that's a safe or usable default in the compiler. But users/frontends can have different defaults if they want.

Pzzzzz5142 commented 5 months ago

Sorry for the late reply, took some time on physically fix the down server. Using the provided steps, I can observe that there is no such cudaMalloc overhead. That is,

Add the output buffer as function argument
Run with --device_allocator=caching --input=1x64x768xf16=-1 --input="&1x64x768xf16"

The modification to the mlir file is so simple that I simply wrote a python script which only modify the function args. And it works pretty fine. So I'm comfortable with this approach.

And the perf result basically reflex the real kernel performance while still has a little overhead which seems to be introduced by the benchmark itself since there is an event sync after every kernel. This would prevent CPU and GPU execute asynchronously. But it seems that it can be ignored as the problem/model size get large.

FYI the nsys report: iree_small_ref_result.nsys-rep.zip

---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_forward/process_time/real_time             0.036 ms        0.058 ms        15836 items_per_second=27.4973k/s
BM_forward/process_time/real_time             0.039 ms        0.066 ms        15836 items_per_second=25.5443k/s
BM_forward/process_time/real_time             0.038 ms        0.062 ms        15836 items_per_second=26.56k/s
BM_forward/process_time/real_time             0.036 ms        0.062 ms        15836 items_per_second=27.9651k/s
BM_forward/process_time/real_time             0.037 ms        0.057 ms        15836 items_per_second=27.0625k/s
BM_forward/process_time/real_time             0.038 ms        0.059 ms        15836 items_per_second=26.3626k/s
BM_forward/process_time/real_time             0.034 ms        0.059 ms        15836 items_per_second=29.1757k/s
BM_forward/process_time/real_time             0.038 ms        0.059 ms        15836 items_per_second=26.2813k/s
BM_forward/process_time/real_time             0.036 ms        0.056 ms        15836 items_per_second=27.8154k/s
BM_forward/process_time/real_time             0.038 ms        0.061 ms        15836 items_per_second=26.4087k/s
BM_forward/process_time/real_time_mean        0.037 ms        0.060 ms           10 items_per_second=27.0673k/s
BM_forward/process_time/real_time_median      0.037 ms        0.059 ms           10 items_per_second=26.8112k/s
BM_forward/process_time/real_time_stddev      0.001 ms        0.003 ms           10 items_per_second=1.06046k/s
BM_forward/process_time/real_time_cv           3.85 %          5.06 %            10 items_per_second=3.92%

Thanks for the help!

stellaraccident commented 5 months ago

Nice!

Not related to the issue you were having, but if trying to benchmark for real, you should generate real inputs (from a random or other source with entropy). The tools take npy files for input and we usually generate them in python and pass them in.

The reason is that on a lot of modern hardware, moving repeating patterns of bits across memory fabrics can be a lot more efficient than moving "real" values. This can result in parts of the machine experiencing different power utilization and can throw performance measurements off. It's a good practice to always test with data of a similar distribution as the real thing.

stellaraccident commented 5 months ago

which seems to be introduced by the benchmark itself

By default the compiler uses a synchronous ABI, which puts the sync "inside" the workload. The reason is historical but suffice to say that it is a lot harder to screw up and the convention held.

There is an alternative ABI ("coarse-fences") which externalizes the synchronization handles so they are passed in to the function and allows the caller to do arbitrary synchronization. That is used for any high performance integrations, but it is harder to use. The tooling will detect it and do the right thing, but I think it will still do a wait inside the benchmark loop to ensure the pipeline completed.

When coming from torch, the compiler builds both ABIs and makes the async one available with a special function name suffix. But we haven't done that work on other input pipelines. There is a generic way to do it, but as Ben says, it isn't compatible with this abi.output thing you are doing right now.

If you can get what you need without getting into the exciting world of async benchmarking, I'd recommend not going down this rabbit hole until you need to since there are things that don't quite connect up on all of the paths. We use these modes for production cases, but the simpler stuff is easier to iterate on.

Pzzzzz5142 commented 5 months ago

Thanks for the explanation on all of my questions.

The reason is that on a lot of modern hardware, moving repeating patterns of bits across memory fabrics can be a lot more efficient than moving "real" values.

Though this mechanism is straight-forward, but it surprises me that this feature happens in the real-world case.

As for the benchmark part, I see the burden here for async benchmarking. Actually, as for now, I'm satisfied for the currently status. So I'll take the advice not to dwell on this case.

Also, thanks again for the detailed answers to these questions! 🚀

benvanik commented 5 months ago

You want to run way more than a single dispatch at a time - otherwise you are indeed just measuring cold start time and synchronization. For example, our dispatch benchmarks have a batch size argument that lets the benchmarking tool specify how many to run at a time and averages the number between them. You could just dispatch the same thing N times and pass --batch_size=N to have the tool do the math for you.

But key thing it to not try to benchmark single dispatches.