taichi-dev / taichi

Productive, portable, and performant GPU programming in Python.
https://taichi-lang.org
Apache License 2.0
25.53k stars 2.29k forks source link

the performance of SparseMatrixBuilder on the CUDA arch is significantly lower than on the CPU arch #7080

Closed tmxklzp closed 1 year ago

tmxklzp commented 1 year ago

Describe the bug the build method of SparseMatrixBuilder takes a long time on the CUDA arch, but looks good on the CPU arch

To Reproduce

import time
import taichi as ti
arch = ti.cuda # or ti.cpu
ti.init(arch=arch)

M = 10000
N_RANGE = 256
N = 9
b = ti.ndarray(ti.f32, shape=N_RANGE)
K = ti.linalg.SparseMatrixBuilder(M, N_RANGE, max_num_triplets=M*N)

@ti.kernel
def fill(A: ti.types.sparse_matrix_builder(), b: ti.types.ndarray()):
    for i in range(M):
        for j in range(N):
            ind = ti.cast(ti.random(float) * N_RANGE, ti.i32)
            A[i, ind] += ti.cast(ti.random(float) * 10, ti.i32)

    for i in range(N_RANGE):
        b[i] += ti.random(float) * 10

time1 = time.time()
fill(K, b)
time2 = time.time()
print('fill time', time2-time1)
A = K.build()
# sm = K.ptr.build_cuda()
time3 = time.time()
# print('after build sm', time3-time2)
print('build time', time3-time2)
B = A.transpose()@A
time4 = time.time()
print('sparse matrix multiply time', time4-time3)
solver = ti.linalg.SparseSolver(solver_type="LLT")
solver.compute(B)
time5 = time.time()
print('solver compute time', time5-time4)
x = solver.solve(b)
time6 = time.time()
print('solve time', time6-time5)

Log/Screenshots The full log of the program:

$ python my_sample_code_on_cuda.py
[Taichi] version 1.3.0, llvm 15.0.4, commit 0f25b95e, linux, python 3.9.15
[Taichi] Starting on arch=cuda
fill time 0.0567018985748291
build time 38.52553367614746
sparse matrix multiply time 0.002500295639038086
solver compute time 0.42672252655029297
solve time 0.00022101402282714844

if I changed arch = ti.cuda to arch = ti.cpu:

arch = ti.cpu
$ python my_sample_code_on_cpu.py
[Taichi] version 1.3.0, llvm 15.0.4, commit 0f25b95e, linux, python 3.9.15
[Taichi] Starting on arch=x64
fill time 0.056569576263427734
build time 0.0018258094787597656
sparse matrix multiply time 0.005051374435424805
solver compute time 0.004403829574584961
solve time 0.00012135505676269531

the build time on CUDA is significantly longer and the solver compute time is also longer than on CPU:

build time 38.52553367614746 # cuda
build time 0.0018258094787597656 # cpu
solver compute time 0.42672252655029297 # cuda
solver compute time 0.004403829574584961 # cpu
FantasyVR commented 1 year ago

Hi @tmxklzp, which Taichi version do you use? The build time is shortened in this PR. Besides, the build time decreases significantly when you run the script for the second time because of the offline cache.

tmxklzp commented 1 year ago

@FantasyVR Yes I solved this by building from taichi source of exactly the commit of the PR you mentioned.

Firstly I found the released 1.3.0 version in Pypi(commit tag: rc-v1.3.0, commit id: 0f25b95e) not have the commit of the PR(commit id: 8413bc2):

$ git merge-base 8413bc2 --is-ancestor 0f25b95e && echo yes || echo no
no

So I tried to build from source of the master branch, which contains the PR. But the build version of taichi could not run the script, the log shows:

[E 01/11/23 17:45:25.542 8382] [dynamic_loader.cpp:load_function@30] Cannot load function: /usr/local/cuda/targets/x86_64-linux/lib/libcusparse.so: undefined symbol: cusparseSsctr

I think there maybe another bug but I don't dig into this...

Then I check out to the exactly commit 8413bc2 and build from source again, this time it runned correctly:

[Taichi] version 1.3.0, llvm 15.0.4, commit 8413bc22, linux, python 3.10.8
[Taichi] Starting on arch=cuda
fill time 0.14100861549377441
build time 0.5260381698608398
sparse matrix multiply time 0.0055048465728759766
solver compute time 1.840867519378662
solve time 0.0001671314239501953

And I have another question: Does the offline cache really work? I rerun the program(for several times) and it costs:

[Taichi] version 1.3.0, llvm 15.0.4, commit 8413bc22, linux, python 3.10.8
[Taichi] Starting on arch=cuda
fill time 0.060143232345581055
build time 0.2373356819152832
sparse matrix multiply time 0.002096414566040039
solver compute time 0.5183939933776855
solve time 0.00011801719665527344

the comparison:

# first time
# rerun
build time 0.5260381698608398
build time 0.2373356819152832

solver compute time 1.840867519378662
solver compute time 0.5183939933776855

The offline_cache is set to be True, and I noticed the description in the doc: "offline_cache: Enable/disable offline cache of the compiled kernels", but the value of sparse matrix is randomly setted after exceute. So why the compiled kernels benefit the exceute time of the build time and the solver compute time? Is it the reason that the sparse matrix took the same shape and number of tripets?

FantasyVR commented 1 year ago

Hi @tmxklzp, if the offline_cache is enabled, you don't need to recompile the python script. You can directly load the compiled compilation results. The saved time is the compilation time. It actually should not have much influence on the build time and solve time.

tmxklzp commented 1 year ago

@FantasyVR Okay I got it. Thank you for helping me!