Taichi slower than CUDA/OpenCL

99991 commented 6 months ago

Describe the bug

I am currently evaluating various frameworks for GPU acceleration for a project of mine and found that Taichi is slower than expected. Due to foreign function call overhead, Taichi is expected to be a little slower than native CUDA, but it should not be three times slower than CuPy with custom kernels.

To Reproduce

Here is a Taichi implementation of matrix-vector multiplication ($A x = b$). Am I missing something?

import taichi as ti
import numpy as np
import time

@ti.kernel
def matvec(A: ti.template(), x: ti.template(), b: ti.template()):
    m, n = A.shape
    ti.loop_config(block_dim=8)
    for i in range(m):
        s = 0.0
        for j in range(n):
            s += A[i, j] * x[j]
        b[i] = s

@ti.kernel
def init(x: ti.template()):
    for i in ti.grouped(x):
        x[i] = ti.random(ti.float32)

def main():
    m = 512
    n = 1024

    A = ti.field(shape=(m, n), dtype=ti.float32)
    x = ti.field(shape=n, dtype=ti.float32)
    b = ti.field(shape=m, dtype=ti.float32)

    init(A)
    init(x)
    init(b)

    b_expected_np = A.to_numpy() @ x.to_numpy()

    for _ in range(100):
        ti.sync()
        start_time = time.perf_counter()

        matvec(A, x, b)

        b_np = b.to_numpy()

        ti.sync()
        elapsed_time = time.perf_counter() - start_time

        print(f"{elapsed_time * 1e6:9.3f} µs")

        assert np.allclose(b_expected_np, b_np)

if __name__ == "__main__":
    ti.init(arch=ti.cuda)
    main()

I've also got matvec implementations for CUDA, OpenCL, CuPy, CuBLAS, Numba and Taichi with other backends here for comparison.

Log/Screenshots

min_time

Additional comments

I have tried this with other Taichi versions, CUDA drivers and GPUs. The results were similar.

System Info

```python $ ti diagnose [Taichi] version 1.7.0, llvm 15.0.4, commit 2fd24490, linux, python 3.11.7 ******************************************* ** Taichi Programming Language ** ******************************************* Docs: https://docs.taichi-lang.org/ GitHub: https://github.com/taichi-dev/taichi/ Forum: https://forum.taichi.graphics/ Taichi system diagnose: python: 3.11.7 (main, Dec 15 2023, 18:12:31) [GCC 11.2.0] system: linux executable: /home/myusername/miniconda3/envs/myenv/bin/python platform: Linux-6.5.0-28-generic-x86_64-with-glibc2.35 architecture: 64bit ELF uname: uname_result(system='Linux', node='f8pc', release='6.5.0-28-generic', version='#29~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr 4 14:39:20 UTC 2', machine='x86_64') /home/myusername/miniconda3/envs/myenv/lib/python3.11/site-packages/taichi/tools/diagnose.py:20: DeprecationWarning: 'locale.getdefaultlocale' is deprecated and slated for removal in Python 3.15. Use setlocale(), getencoding() and getlocale() instead. print(f'locale: {".".join(locale.getdefaultlocale())}') locale: en_US.UTF-8 PATH: /home/myusername/miniconda3/envs/myenv/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:~/executables PYTHONPATH: ['/home/myusername/miniconda3/envs/myenv/bin', '/home/myusername/miniconda3/envs/myenv/lib/python311.zip', '/home/myusername/miniconda3/envs/myenv/lib/python3.11', '/home/myusername/miniconda3/envs/myenv/lib/python3.11/lib-dynload', '/home/myusername/miniconda3/envs/myenv/lib/python3.11/site-packages', '/media/myusername/samsung870qvo4tb/data/stable-diffusion/OneTrainer/src/diffusers/src', '/media/myusername/samsung870qvo4tb/data/stable-diffusion/OneTrainer/src/mgds/src'] No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 22.04.4 LTS Release: 22.04 Codename: jammy import: cpu: True metal: False opengl: True cuda: True vulkan: True `glewinfo` not available: [Errno 2] No such file or directory: 'glewinfo' Thu May 9 18:46:25 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3060 ... Off | 00000000:01:00.0 On | N/A | | N/A 44C P8 11W / 80W | 153MiB / 6144MiB | 17% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 985 G /usr/lib/xorg/Xorg 149MiB | +-----------------------------------------------------------------------------------------+ [Taichi] version 1.7.0, llvm 15.0.4, commit 2fd24490, linux, python 3.11.7 [Taichi] version 1.7.0, llvm 15.0.4, commit 2fd24490, linux, python 3.11.7 [Taichi] Starting on arch=x64 [Taichi] version 1.7.0, llvm 15.0.4, commit 2fd24490, linux, python 3.11.7 [Taichi] Starting on arch=opengl [Taichi] version 1.7.0, llvm 15.0.4, commit 2fd24490, linux, python 3.11.7 [Taichi] Starting on arch=cuda [Taichi] version 1.7.0, llvm 15.0.4, commit 2fd24490, linux, python 3.11.7 ******************************************* ** Taichi Programming Language ** ******************************************* Docs: https://docs.taichi-lang.org/ GitHub: https://github.com/taichi-dev/taichi/ Forum: https://forum.taichi.graphics/ TAICHI EXAMPLES ──────────────────────────────────────────────────────────────────────────────────── 0: ad_gravity 25: karman_vortex_street 50: patterns 1: circle_packing_image 26: keyboard 51: pbf2d 2: comet 27: laplace 52: physarum 3: cornell_box 28: laplace_equation 53: poisson_disk_sampling 4: diff_sph 29: mandelbrot_zoom 54: print_offset 5: differential_evolution 30: marching_squares 55: rasterizer 6: euler 31: mass_spring_3d_ggui 56: regression 7: eulerfluid2d 32: mass_spring_game 57: sdf_renderer 8: explicit_activation 33: mass_spring_game_ggui 58: simple_derivative 9: export_mesh 34: mciso_advanced 59: simple_texture 10: export_ply 35: mgpcg 60: simple_uv 11: export_videos 36: mgpcg_advanced 61: snow_phaseField 12: fem128 37: minimal 62: stable_fluid 13: fem128_ggui 38: minimization 63: stable_fluid_ggui 14: fem99 39: mpm128 64: stable_fluid_graph 15: fractal 40: mpm128_ggui 65: taichi_bitmasked 16: fractal3d_ggui 41: mpm3d 66: taichi_dynamic 17: fullscreen 42: mpm3d_ggui 67: taichi_logo 18: game_of_life 43: mpm88 68: taichi_ngp 19: gui_image_io 44: mpm88_graph 69: taichi_sparse 20: gui_widgets 45: mpm99 70: texture_graph 21: implicit_fem 46: mpm_lagrangian_forces 71: tutorial 22: implicit_mass_spring 47: nbody 72: two_stream_instability 23: initial_value_problem 48: odop_solar 73: vortex_rings 24: jacobian 49: oit_renderer 74: waterwave ──────────────────────────────────────────────────────────────────────────────────── 42 Running example minimal ... [Taichi] Starting on arch=x64 42.0 >>> Running time: 0.19s Consider attaching this log when maintainers ask about system information. >>> Running time: 4.34s ```

bobcao3 commented 5 months ago

Can you show us the kernels you used for cuda_matvec / opencl_matvec ? Taichi only parallelizes the outer loop, and GPUs don't work well with block dims smaller than 32. You should target a 128 block dim and roll the both loops into one for: for i, j in ti.ndrange(...) or for i, j in A, you should see significant speedup doing it that way

99991 commented 5 months ago

Can you show us the kernels you used for cuda_matvec / opencl_matvec ?

You can find the code for the other implementations by clicking on the word "here" in my post. For your convenience, here is the link again: https://github.com/99991/matvec-gpu/tree/main?tab=readme-ov-file#results

Taichi only parallelizes the outer loop, and GPUs don't work well with block dims smaller than 32. You should target a 128 block dim

I tried multiple block dims and found that a block dim of 8 was among the fastest. Here is a graph showing block dim vs min execution time over 10000 runs, but note that there is some jitter in there, so it probably does not matter as long as the block dim is not larger than 128.

times

You should target a 128 block dim and roll the both loops into one for: for i, j in ti.ndrange(...) or for i, j in A, you should see significant speedup doing it that way

This makes the code slower, because there is is a race condition when summing up the result, which has to be resolved with atomics.

for i, j in A:
    b[i] += A[i, j] * x[j]
#   ^^^^^^^
#   race condition when multiple threads want to add to same b[i]

But this is not important, because I am not interested in a faster matvec implementation. Instead, I want to know why Taichi is slower than CuPy and OpenCL for the same implementation. This matvec implementation just serves as an example.

bobcao3 commented 5 months ago

I would argue that forcing a block dim of 8 is never a good idea, and this isn't a typical implementation of mat @ vec. We have tested a more conventional approach before.

Taichi has optimization for atomic reduction, I'd recommend you test with that.

Also judging from the numbers you posted, I would say the matrix is way too small. The GPU results here might very well be just a CPU overhead test. This might also explain why the tiny block size works well. It's only using 512 threads, which means unless you use very small block size there simply isn't enough threads to cover the entire GPU.

And yes we admit Taichi has pretty high CPU overhead.

99991 commented 5 months ago

I would argue that forcing a block dim of 8 is never a good idea, and this isn't a typical implementation of mat @ vec. We have tested a more conventional approach before. Taichi has optimization for atomic reduction, I'd recommend you test with that.

Again, the point of my issue is not to implement an efficient matvec implementation. It just serves as an example to demonstrate that Taichi is slower than CUDA.

Also judging from the numbers you posted, I would say the matrix is way too small. The GPU results here might very well be just a CPU overhead test. This might also explain why the tiny block size works well. It's only using 512 threads, which means unless you use very small block size there simply isn't enough threads to cover the entire GPU.

I increased m and n to 8192 each. Taichi is still significantly slower than CuPy and the performance is roughly the same for block dims between 4 and 128.

To make sure that CPU overhead is definitely not the issue, here is a naive matrix multiplication implementation.

Naive matrix multiplication with Taichi

~800 ms

import taichi as ti
import numpy as np
import time

@ti.kernel
def matmul(A: ti.template(), B: ti.template(), C: ti.template()):
    _, n = A.shape
    ti.loop_config(block_dim=128)
    for i, j in C:
        s = 0.0
        for k in range(n):
            s += A[i, k] * B[k, j]
        C[i, j] = s

@ti.kernel
def init(x: ti.template()):
    for i in ti.grouped(x):
        x[i] = ti.random(ti.float32)

def main():
    m = 4096
    k = 4096
    n = 4096

    A = ti.field(shape=(m, k), dtype=ti.float32)
    B = ti.field(shape=(k, n), dtype=ti.float32)
    C = ti.field(shape=(m, n), dtype=ti.float32)

    init(A)
    init(B)
    init(C)

    C_expected_np = A.to_numpy() @ B.to_numpy()

    min_time = float("inf")
    for _ in range(100):
        ti.sync()
        start_time = time.perf_counter()

        matmul(A, B, C)

        C_np = C.to_numpy()

        ti.sync()
        elapsed_time = time.perf_counter() - start_time

        print(f"{elapsed_time * 1e3:9.3f} ms")

        assert np.allclose(C_expected_np, C_np)

        min_time = min(min_time, elapsed_time)

    print(f"Min: {min_time * 1e3:9.3f} ms")

if __name__ == "__main__":
    ti.init(arch=ti.cuda, kernel_profiler=True)
    main()
    ti.profiler.print_kernel_profiler_info()

Naive matrix multiplication with CuPy

~300 ms

import cupy as cp
import numpy as np
import time

matmul = cp.RawKernel(
r'''
extern "C" __global__
void matmul(
    float *A,
    float *B,
    float *C,
    int m,
    int k,
    int n
){
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    int j = blockDim.y * blockIdx.y + threadIdx.y;

    if (i >= m || j >= n) return;

    float s = 0;
    for (int k2 = 0; k2 < k; k2++){
        s += A[i * k + k2] * B[k2 * n + j];
    }

    C[i * n + j] = s;
}
''', "matmul")

def main():
    m = 4096
    k = 4096
    n = 4096

    A = cp.random.rand(m, k, dtype=cp.float32)
    B = cp.random.rand(k, n, dtype=cp.float32)
    C = cp.random.rand(m, n, dtype=cp.float32)

    C_expected_np = A.get() @ B.get()

    min_time = float("inf")
    for _ in range(100):
        cp.cuda.Device().synchronize()
        start_time = time.perf_counter()

        block_size = (1, 128)
        grid_size = (ceil_div(m, block_size[0]), ceil_div(n, block_size[1]))

        matmul(grid_size, block_size, (A, B, C, m, k, n))

        C_np = C.get()

        cp.cuda.Device().synchronize()
        elapsed_time = time.perf_counter() - start_time

        print(f"{elapsed_time * 1e3:9.3f} ms")

        assert np.allclose(C_expected_np, C_np)

        min_time = min(min_time, elapsed_time)

    print(f"Min: {min_time * 1e3:9.3f} ms")

def ceil_div(a, b):
    return (a + b - 1) // b

if __name__ == "__main__":
    main()

As can be seen, Taichi is even slower here. To be fair, the difference could also be due to Taichi not tiling the matrix in a favorable way (which is why I chose the matrix-vector multiplication example in my original issue), but ti.loop_config seems to lack a two-dimensional block dim parameter.

marioroy commented 3 months ago

@99991, thank you for this thread and your GitHub repo. Using your repo, it turns out the Vulkan and OpenGL backends perform well simply commenting out the ti.loop_config(block_dim=N) line. That is the case running on a desktop RTX 3070 GPU. The frameworks Numba CUDA, Taichi Vulkan, and Taichi OpenGL perform similarly. Taichi CUDA is faster, trailing behind cuda_matvec.

I captured the Unix time computing n, m = 8192, 8192 one thousand times. The block_dim line is commented out in the Taichi demonstrations. block_size is set to 32 in cuda_matvec.cu, cupy_matvec.py, and numba_matvec.py.

for _ in range(1000):
    ...

| Framework     | Unix Time | Notes                            |
| Numba CUDA    | 2.944s    | block_size = 32                  |
| Taichi CPU    | 4.558s    | AMD Ryzen Threadripper 3970X CPU |
| Taichi OpenGL | 3.044s    | ti.loop_config commented out     |
| Taichi Vulkan | 2.980s    | ti.loop_config commented out     |
| Taichi CUDA   | 2.218s    | ti.loop_config commented out     |
| nvcc CUDA     | 1.887s    | block_size = 32                  |
| PyTorch CUDA  | 1.852s    |                                  |
| CuPy CUDA     | 1.477s    | block_size = 32                  |
| cuBLAS CUDA   | 1.211s    |                                  |

The cuda_matvec.cc was built with the -fmad=false option. No errors with assert(err < 1e-7f) or smaller.

nvcc -o cuda_matvec -O3 -fmad=false cuda_matvec.cu

I added Numba type signatures to the matvec function. Doing so speeds up JIT compilation.

@cuda.jit('void(f4[:,:], f4[:], f4[:], i4, i4)')
def matvec(A, x, b, m, n):
    ...

marioroy commented 3 months ago

Here is the CPU variant where I tried using external arrays as Taichi kernel arguments. This runs slower on the GPU but no impact running on the CPU versus ti.field.

taichi_cpu_matvec.py

import taichi as ti
import numpy as np
import time

@ti.kernel
def matvec(A: ti.types.ndarray(), x: ti.types.ndarray(), b: ti.types.ndarray()):
    m, n = A.shape
    for i in range(m):
        s = 0.0
        for j in range(n):
            s += A[i, j] * x[j]
        b[i] = s

def main():
    m = 8192
    n = 8192

    A = np.random.rand(m, n).astype(np.float32)
    x = np.random.rand(n).astype(np.float32)
    b = np.zeros(m, dtype=np.float32)

    b_expected = A @ x

    for _ in range(1000):
        start_time = time.perf_counter()
        matvec(A, x, b)
        ti.sync()

        elapsed_time = time.perf_counter() - start_time
        print(f"{elapsed_time * 1e6:9.3f} µs")
        assert np.allclose(b_expected, b)

if __name__ == "__main__":
    ti.init(arch=ti.cpu)
    main()

taichi-dev / taichi

Taichi slower than CUDA/OpenCL #8526

Naive matrix multiplication with Taichi

Naive matrix multiplication with CuPy