Open yuanming-hu opened 4 years ago
One performance issue I found:
import taichi as ti
ti.init(arch=ti.cuda, print_ir=True, kernel_profiler=True)
a = ti.var(ti.i32)
b = ti.var(ti.i32)
block = ti.root.pointer(ti.i, 1024 * 128)
block.dense(ti.i, 1024).place(a)
block.dense(ti.i, 1024).place(b)
@ti.kernel
def activate():
for i in range(1024 * 128 * 1024):
a[i] = i
@ti.kernel
def copy():
for i in a:
b[i] = a[i]
activate()
for i in range(10):
copy()
ti.kernel_profiler_print()
Current copy
takes 11ms with IR being
kernel {
$0 = offloaded clear_list S1pointer
$1 = offloaded listgen S0root->S1pointer
$2 = offloaded clear_list S2dense
$3 = offloaded listgen S1pointer->S2dense
$4 = offloaded struct_for(S2dense) block_dim=0 {
<i32 x1> $5 = loop $4 index 0
<gen*x1> $6 = get root
<i32 x1> $7 = const [0]
<gen*x1> $8 = [S0root][root]::lookup($6, $7) activate = false
<gen*x1> $9 = get child [S0root->S1pointer] $8
<i32 x1> $10 = bit_extract($5 + 0) bit_range=[10, 27)
<gen*x1> $11 = [S1pointer][pointer]::lookup($9, $10) activate = false
<gen*x1> $12 = get child [S1pointer->S2dense] $11
<i32 x1> $13 = bit_extract($5 + 0) bit_range=[0, 10)
<gen*x1> $14 = [S2dense][dense]::lookup($12, $13) activate = false
<i32*x1> $15 = get child [S2dense->S3place_i32] $14
<i32 x1> $16 = global load $15
<gen*x1> $17 = [S1pointer][pointer]::lookup($9, $10) activate = true
<gen*x1> $18 = get child [S1pointer->S4dense] $17
<gen*x1> $19 = [S4dense][dense]::lookup($18, $13) activate = false
<i32*x1> $20 = get child [S4dense->S5place_i32] $19
<i32 x1> $21 : global store [$20 <- $16]
}
}
Note that $17
does an extra activation.
After #1270, it takes 6.8 ms (1.6x faster) and the IR becomes
kernel {
$0 = offloaded clear_list S1pointer
$1 = offloaded listgen S0root->S1pointer
$2 = offloaded clear_list S2dense
$3 = offloaded listgen S1pointer->S2dense
$4 = offloaded struct_for(S2dense) block_dim=0 {
<i32 x1> $5 = loop $4 index 0
<gen*x1> $6 = get root
<i32 x1> $7 = const [0]
<gen*x1> $8 = [S0root][root]::lookup($6, $7) activate = false
<gen*x1> $9 = get child [S0root->S1pointer] $8
<i32 x1> $10 = bit_extract($5 + 0) bit_range=[10, 27)
<gen*x1> $11 = [S1pointer][pointer]::lookup($9, $10) activate = false
<gen*x1> $12 = get child [S1pointer->S2dense] $11
<i32 x1> $13 = bit_extract($5 + 0) bit_range=[0, 10)
<gen*x1> $14 = [S2dense][dense]::lookup($12, $13) activate = false
<i32*x1> $15 = get child [S2dense->S3place_i32] $14
<i32 x1> $16 = global load $15
<gen*x1> $17 = get child [S1pointer->S4dense] $11
<gen*x1> $18 = [S4dense][dense]::lookup($17, $13) activate = false
<i32*x1> $19 = get child [S4dense->S5place_i32] $18
<i32 x1> $20 : global store [$19 <- $16]
}
}
We haven't done a systematic performance engineering since we switch to the LLVM backend. I suspect there to be a lot of space for performance optimization, especially when it comes to sparsity and hierarchical data structures. Note that we have done pretty big changes to the IR/Opt system without performance regression tests, so sparse computation performance probably has degraded over time (e.g. #1182).
Steps