Benchmarks WGPU: Add benchmarks for reduce operations

mmalczak commented 1 year ago

Feature description

Add benchmarks for reduce operations: Reduce one dimension:

Sum dim
Mean dim
Argmax
Argmin

Reduce full tensor to a scalar:

Sum

There is an open issue to improve the performance of reduce kernels: https://github.com/burn-rs/burn/issues/536 Before starting to work on performance, we need proper benchmarks.

vini-fda commented 9 months ago

I'm interested in this, and I've already done some early experimentation to measure the performance of burn externally (i.e. using it as an external crate) and compare it with my own implementation. I have a few questions about these internal benchmarks before contributing:

How do we avoid memory allocation and device-host synchronization overhead? or, in other words, how do we account for it in our measurements?
Should we measure CPU time, GPU time (through timestamp writes for example) or both?
As the reduction kernels are of low operational intensity, should we measure the performance of the kernels in isolation (i.e. a hot loop) or should we also measure the performance of the kernels in a more realistic scenario (i.e. a kernel that does some work before and after the reduction kernel)?

nathanielsimard commented 9 months ago

There are some benchmarks in burn-wgpu/benches/reduction.rs, but we could put them in backend-comparison instead. We are missing benchmarks for global reduction such as mean and sum.

tracel-ai / burn

Benchmarks WGPU: Add benchmarks for reduce operations #584

Feature description