Open xzlinux opened 2 months ago
Part of the problem seems to be you're trying to write a serial algorithm on the GPU, your outer loop has ti.loop_config(serialize=True). For it to be fast on GPU your outer loop needs to be the parallel one.
Other things which can help (though I don't think are relevant here) is to use data which resides on GPU to avoid transfer, e.g. torch tensor on the device can avoid copying.
import taichi as ti
import numpy as np
ti.init(arch=ti.gpu) benchmark = True N = 15000 if benchmark: a_numpy = np.random.randint(0,100,N,dtype=np.int32) b_numpy = np.random.randint(0,100,N,dtype=np.int32) else: a_numpy = np.array([0,1,0,2,4,3,1,2,1],dtype=np.int32) b_numpy = np.array([4,0,1,4,5,3,1,2],dtype=np.int32) f = ti.field(dtype=ti.i32,shape=(N+1,N+1))
@ti.kernel def compute_lcs(a: ti.types.ndarray(),b: ti.types.ndarray()) -> ti.i32: len_a,len_b = a.shape[0],b.shape[0] ti.loop_config(serialize=True) for i in range(1,len_a + 1): for j in range(1,len_b + 1): f[i,j] = ti.max(f[i-1,j-1] + (a[i-1] == b[j-1]),ti.max(f[i-1,j],f[i,j -1])) return f[len_a,len_b]
print(compute_lcs(a_numpy,b_numpy))
The following is not to start the cpu <img width="528" alt="image" src="https://github.com/user-attachments/assets/03d62b88-11b2-4a14-a4b3-