taichi-dev / taichi

Productive, portable, and performant GPU programming in Python.
https://taichi-lang.org
Apache License 2.0
25.51k stars 2.28k forks source link

[Perf] Improve sparse computation performance #1263

Open yuanming-hu opened 4 years ago

yuanming-hu commented 4 years ago

We haven't done a systematic performance engineering since we switch to the LLVM backend. I suspect there to be a lot of space for performance optimization, especially when it comes to sparsity and hierarchical data structures. Note that we have done pretty big changes to the IR/Opt system without performance regression tests, so sparse computation performance probably has degraded over time (e.g. #1182).

Steps

  1. Inspect sparse computation performance.
    • ...
  2. Enable performance regression tests on an MIT server.
yuanming-hu commented 4 years ago

One performance issue I found:

import taichi as ti

ti.init(arch=ti.cuda, print_ir=True, kernel_profiler=True)

a = ti.var(ti.i32)
b = ti.var(ti.i32)

block = ti.root.pointer(ti.i, 1024 * 128)
block.dense(ti.i, 1024).place(a)
block.dense(ti.i, 1024).place(b)

@ti.kernel
def activate():
    for i in range(1024 * 128 * 1024):
        a[i] = i

@ti.kernel
def copy():
    for i in a:
        b[i] = a[i]

activate()

for i in range(10):
    copy()

ti.kernel_profiler_print()

Current copy takes 11ms with IR being

kernel {
  $0 = offloaded clear_list S1pointer
  $1 = offloaded listgen S0root->S1pointer
  $2 = offloaded clear_list S2dense
  $3 = offloaded listgen S1pointer->S2dense
  $4 = offloaded struct_for(S2dense) block_dim=0 {
    <i32 x1> $5 = loop $4 index 0
    <gen*x1> $6 = get root
    <i32 x1> $7 = const [0]
    <gen*x1> $8 = [S0root][root]::lookup($6, $7) activate = false
    <gen*x1> $9 = get child [S0root->S1pointer] $8
    <i32 x1> $10 = bit_extract($5 + 0) bit_range=[10, 27)
    <gen*x1> $11 = [S1pointer][pointer]::lookup($9, $10) activate = false
    <gen*x1> $12 = get child [S1pointer->S2dense] $11
    <i32 x1> $13 = bit_extract($5 + 0) bit_range=[0, 10)
    <gen*x1> $14 = [S2dense][dense]::lookup($12, $13) activate = false
    <i32*x1> $15 = get child [S2dense->S3place_i32] $14
    <i32 x1> $16 = global load $15
    <gen*x1> $17 = [S1pointer][pointer]::lookup($9, $10) activate = true
    <gen*x1> $18 = get child [S1pointer->S4dense] $17
    <gen*x1> $19 = [S4dense][dense]::lookup($18, $13) activate = false
    <i32*x1> $20 = get child [S4dense->S5place_i32] $19
    <i32 x1> $21 : global store [$20 <- $16]
  }
}

Note that $17 does an extra activation.

After #1270, it takes 6.8 ms (1.6x faster) and the IR becomes

kernel {
  $0 = offloaded clear_list S1pointer
  $1 = offloaded listgen S0root->S1pointer
  $2 = offloaded clear_list S2dense
  $3 = offloaded listgen S1pointer->S2dense
  $4 = offloaded struct_for(S2dense) block_dim=0 {
    <i32 x1> $5 = loop $4 index 0
    <gen*x1> $6 = get root
    <i32 x1> $7 = const [0]
    <gen*x1> $8 = [S0root][root]::lookup($6, $7) activate = false
    <gen*x1> $9 = get child [S0root->S1pointer] $8
    <i32 x1> $10 = bit_extract($5 + 0) bit_range=[10, 27)
    <gen*x1> $11 = [S1pointer][pointer]::lookup($9, $10) activate = false
    <gen*x1> $12 = get child [S1pointer->S2dense] $11
    <i32 x1> $13 = bit_extract($5 + 0) bit_range=[0, 10)
    <gen*x1> $14 = [S2dense][dense]::lookup($12, $13) activate = false
    <i32*x1> $15 = get child [S2dense->S3place_i32] $14
    <i32 x1> $16 = global load $15
    <gen*x1> $17 = get child [S1pointer->S4dense] $11
    <gen*x1> $18 = [S4dense][dense]::lookup($17, $13) activate = false
    <i32*x1> $19 = get child [S4dense->S5place_i32] $18
    <i32 x1> $20 : global store [$19 <- $16]
  }
}