Open ailzhang opened 2 years ago
I have reproduced the performance gap, compute latency for CUDA is around 13ms but for vulkan it's 43ms, 3x slower. The major bottleneck is that Vulkan backend cannot optimize the access to arrays as good as CUDA. So if we replace some global array accessing methods with scalar compuation, we can easily increase vulkan's frame rate. The figure illustrates using acc_val
to replace acc[i]
. You can also apply the same scalarize technique to den[i]
. I have observed an inspiring 23fps->45fps bump for Vulkan backend on my RTX3080 GPU, but no change for CUDA (still 13ms and 60 fps).
@ti.kernel
def update_density(pos: ti.any_arr(field_dim=1), den: ti.any_arr(field_dim=1), pre: ti.any_arr(field_dim=1)):
for i in range(particle_num):
density = 0.0
for j in range(particle_num):
R = pos[i] - pos[j]
density += mass * W(R, h)
pre[i] = pressure_scale * max(pow(density / rest_density, gamma) - 1, 0)
den[i] = density
@ti.kernel
def update_force(
pos: ti.any_arr(field_dim=1), vel: ti.any_arr(field_dim=1), den: ti.any_arr(field_dim=1), pre: ti.any_arr(field_dim=1), acc: ti.any_arr(field_dim=1), gravity: ti.any_arr(field_dim=0)
):
for i in range(particle_num):
acc_val = gravity[None]
for j in range(particle_num):
R = pos[i] - pos[j]
acc_val += (
-mass
* (pre[i] / (den[i] * den[i]) + pre[j] / (den[j] * den[j]))
* W_gradient(R, h)
)
acc_val += (
viscosity_scale
* mass
* (vel[i] - vel[j]).dot(R)
/ (R.norm() + 0.01 * h * h)
/ den[j]
* W_gradient(R, h)
)
R2 = R.dot(R)
D2 = particle_diameter * particle_diameter
if R2 > D2:
acc_val += -tension_scale * R * W(R, h)
else:
acc_val += (
-tension_scale
* R
* W(ti.Vector([0.0, 1.0, 0.0]) * particle_diameter, h)
)
acc[i] = acc_val
By the way, the compute kernels for SPH is quite similar with N-Body, I think there is a quite subtle cause to this performance gap. I'll build a simpler benchmark kernel to see if I can reproduce the problem.
This seem like a case of load to store forwarding not working properly within CHI-IR
It's not working. The args are ndarray, so it's ExternalPtrStmt in CHI-IR.
The pass ignores ExternalPtrStmt (see here).
Update: can also observe the redundant load/store in the PTX code, but there's no performance difference among the two versions. Is it optimized by the CUDA runtime?
This is an issue from @YuCrazing Running the following script give 47fps on CUDA backend and 8.8fps on Vulkan backend (RTX 2060).