taichi-dev / taichi

Productive, portable, and performant GPU programming in Python.
https://taichi-lang.org
Apache License 2.0
25.54k stars 2.29k forks source link

Why is it so much slower to turn on a gpu than not #8589

Open xzlinux opened 2 months ago

xzlinux commented 2 months ago

import taichi as ti

import numpy as np

ti.init(arch=ti.gpu) benchmark = True N = 15000 if benchmark: a_numpy = np.random.randint(0,100,N,dtype=np.int32) b_numpy = np.random.randint(0,100,N,dtype=np.int32) else: a_numpy = np.array([0,1,0,2,4,3,1,2,1],dtype=np.int32) b_numpy = np.array([4,0,1,4,5,3,1,2],dtype=np.int32) f = ti.field(dtype=ti.i32,shape=(N+1,N+1))

@ti.kernel def compute_lcs(a: ti.types.ndarray(),b: ti.types.ndarray()) -> ti.i32: len_a,len_b = a.shape[0],b.shape[0] ti.loop_config(serialize=True) for i in range(1,len_a + 1): for j in range(1,len_b + 1): f[i,j] = ti.max(f[i-1,j-1] + (a[i-1] == b[j-1]),ti.max(f[i-1,j],f[i,j -1])) return f[len_a,len_b]

print(compute_lcs(a_numpy,b_numpy))

image

The following is not to start the cpu <img width="528" alt="image" src="https://github.com/user-attachments/assets/03d62b88-11b2-4a14-a4b3-

oliver-batchelor commented 2 weeks ago

Part of the problem seems to be you're trying to write a serial algorithm on the GPU, your outer loop has ti.loop_config(serialize=True). For it to be fast on GPU your outer loop needs to be the parallel one.

Other things which can help (though I don't think are relevant here) is to use data which resides on GPU to avoid transfer, e.g. torch tensor on the device can avoid copying.