Passing struct vs deferencing fields in struct performance

bsavery commented 2 years ago

In short I see significant performance decrease in passing a struct to a function vs dereferencing the struct fields and passing the values.

Here I show two ways to intersect a ray with a bunch of spheres, one passing the sphere struct vs getting the values of the sphere struct and passing those.

import taichi as ti
import time

ti.init(arch=ti.gpu)

n_spheres = 100000

# just a ray pointing in Z direction
ray_org = ti.Vector([0.0, 0.0, 0.0])
ray_dir = ti.Vector([0.0, 0.0, 1.0])

# random set of spheres
sphere_type = ti.types.struct(center=ti.types.vector(3, ti.f32), radius=ti.f32)
spheres = sphere_type.field(shape=(n_spheres,))
@ti.kernel
def fill_spheres():
    # generate random spheres in 0-100 xyz with radius 0-10
    for i in spheres:
        spheres[i].radius = ti.random() * 10.0
        spheres[i].center = ti.Vector([ti.random(), ti.random(), ti.random()]) * 100.0
fill_spheres()

@ti.kernel
def pass_reference() -> ti.i32:
    num_hit = 0
    for i in spheres:
        sphere = spheres[i]
        if intersect_sphere(sphere, ray_org, ray_dir):
            num_hit += 1
    return num_hit

@ti.kernel
def pass_decomposed() -> ti.i32:
    num_hit = 0
    for i in spheres:
        sphere = spheres[i]
        center, radius = sphere.center, sphere.radius
        if intersect_center_radius(center, radius, ray_org, ray_dir):
            num_hit += 1
    return num_hit

@ti.func
def intersect_sphere(sphere, ray_origin, ray_direction):
    # return if the ray hits the sphere
    oc = ray_origin - sphere.center
    a = ray_direction.norm_sqr()
    half_b = oc.dot(ray_direction)
    c = (oc.norm_sqr() - sphere.radius**2)
    discriminant = (half_b**2) - a * c

    return discriminant >= 0.0

@ti.func
def intersect_center_radius(center, radius, ray_origin, ray_direction):
    # return if the ray hits the decomposed sphere
    oc = ray_origin - center
    a = ray_direction.norm_sqr()
    half_b = oc.dot(ray_direction)
    c = (oc.norm_sqr() - radius**2)
    discriminant = (half_b**2) - a * c

    return discriminant >= 0.0

t = time.time()
pass_reference()
print('Passing reference', time.time() - t)

t = time.time()
pass_decomposed()
print('Passing decomposed', time.time() - t)

Also note metal vs vulkan effects quite a bit: Metal Passing reference 0.23202180862426758 Passing decomposed 0.09195494651794434

Vulkan Passing reference 0.07849979400634766 Passing decomposed 0.05303597450256348

CPU Passing reference 0.08902120590209961 Passing decomposed 0.06803393363952637

Originally posted by @bsavery in https://github.com/taichi-dev/taichi/discussions/4784#discussioncomment-2576249

FantasyVR commented 2 years ago

cc @qiao-bo @turbo0628

k-ye commented 2 years ago

As a good first step, we can enable ti.init(..., print_ir=True) to see if there's significant difference in CHI IR. See https://github.com/taichi-dev/taichi/discussions/4784#discussioncomment-2570901

k-ye commented 2 years ago

Note that you might also want to exclude the timing for the first run, because that counts JIT time as well.

bsavery commented 2 years ago

@k-ye thanks for that.

So with a simple change to exclude the jit time (run once before timing) I get:

Vulkan Passing reference 0.02786087989807129 Passing decomposed 0.047796010971069336 (worse!)

CPU Passing reference 0.01233816146850586 Passing decomposed 0.011642217636108398

Metal: Passing reference 0.0014309883117675781 Passing decomposed 0.0014178752899169922

So there is something coming out with different code, and the previous timing seemed to more reflect that the compilation / JIT time is different for each kernel (because of the different code being generated)

Side question: Is there a way to not recompile kernels each test run? I.e. if I run python test.py and the code for the kernel doesn't change, it would be nice to not recompile next time I call python test.py. Not sure how you could do this, maybe with timestamps or checking if the IR is different, but you could imagine this being useful.

k-ye commented 2 years ago

Is there a way to not recompile kernels each test run?

Yup, @PGZXB 's working on an offline cache system (https://github.com/taichi-dev/taichi/issues/4401), starting with the LLVM backend. We are using the AST as the cache key now. @PGZXB only has one day or so per week to work on Taichi, so the feature is a bit slow to release. But we are moving towards that direction. Thanks for your suggestion :-)

bobcao3 commented 2 years ago

That's quite fascinating. I'd imagine the CHI-IR generated should be quite similar, and it's weird that we only see Vulkan with a big regression. (And I'm assuming this is on a mac? Where SPIR-V is actually translated to MSL by MoltenVK...)

We should check three things:

Check for consistency and reproducibility on other devices (i.e. whether this is a problem that interacts with specific environment or this is an issue only from codegen)
Compare the CHI-IR differences
Compare the SPIR-V differences

Something else that can be quite helpful is to run the two SPIR-V through Radeon Graphics Analyzer as well to get the raw assembly for the #inst & cycle latency readings

taichi-dev / taichi

Passing struct vs deferencing fields in struct performance #4802