numba / numba

NumPy aware dynamic Python compiler using LLVM
https://numba.pydata.org/
BSD 2-Clause "Simplified" License
9.94k stars 1.13k forks source link

Type inference leads to libNVVM bugs on 0.59.1 #9514

Open c200chromebook opened 7 months ago

c200chromebook commented 7 months ago

Hi there - please see the reproducer below. Basically, and somewhat oddly, if you define types explicitly when using lineinfo with record types, things work fine. If you allow numba to infer types, things do not work at all and you get "LLVM ERROR: Symbol name with unsupported characters." The only thing different about the types going into the underlying compiler machinery seems to be the aligned flag, not sure why that would do this. Upgrades to the cuda toolkit version does not seem to make a difference.

import numba as nb
import numpy as np
from numba import cuda

inforce = np.dtype([('I16Scales', [('EffRiderTypeId', '<i2')]),
                    ('WBFlag', '<i4'), ('PerTsData', [('EventFlags', '<u8', (421,))])], align=True)

state = np.dtype([('float64_vec', [('nGRBW', '<f8', (32,)), ('nRYTDGALB', '<f8', (32,)), ])], align=True)

# @cuda.jit(nb.void(nb.from_dtype(inforce)[:], nb.from_dtype(state)[:]), lineinfo=True) # works
@cuda.jit(lineinfo=True)  # fails
def time_steps_kernel(inforce_buffer, scen_buffer):
    sh_flags = inforce_buffer[5]['I16Scales']['EffRiderTypeId']
    if sh_flags & 1:
        scen_buffer[3]['float64_vec']['nRYTDGALB'][3] = 7
    scen_buffer[3]['float64_vec']['nRYTDGALB'][3] = 0.0
    sh_flags = inforce_buffer[5]['PerTsData']['EventFlags'][5]
    if inforce_buffer[5]['I16Scales']['EffRiderTypeId']:
        if inforce_buffer[5]['WBFlag']:
            scen_buffer[3]['float64_vec']['nGRBW'][3] = 7
        if sh_flags & 1:
            scen_buffer[3]['float64_vec']['nRYTDGALB'][3] = 2
        scen_buffer[3]['float64_vec']['nRYTDGALB'][3] = 1

# pylint:disable=unsubscriptable-object,invalid-name
def launch_time_steps_kernel(kern):
    """launch the kernel
    """
    ts_args = [np.zeros((3,), dtype=inforce), np.zeros((3,), dtype=state)]
    kern[1, 1](*ts_args)

launch_time_steps_kernel(time_steps_kernel)
print("Done")

Reporting a bug

guilhermeleobas commented 7 months ago

Thanks for the report. I can reproduce the error:

$ python repro.py
/home/guilhermeleobas/git/numba/numba/cuda/dispatcher.py:536: NumbaPerformanceWarning: Grid size 1 will likely result in GPU under-utilization due to low occupancy.
  warn(NumbaPerformanceWarning(msg))
LLVM ERROR: Symbol name with unsupported characters
gmarkall commented 7 months ago

I can reproduce this too - I need to look a bit deeper into what's going on in NVVM here.

dlee992 commented 4 months ago

I also tested this case locally. If just using @cuda.jit, this test will pass. So the issue comes with lineinfo=True option. And it's also related to this complex branch structure, if randomly deleting one or two branches, this test will also pass even with lineinfo=True.

The final error msg is quite lower-level, LLVM ERROR: Symbol name with unsupported characters. Normally how do we debug this kind of error for CUDA? I guess we need to focus on lowering pass?

dlee992 commented 4 months ago

BTW, when I want to see optimized NVVM IR, which envvar should be useful? I tried with:

# os.environ["NUMBA_CUDA_DEBUGINFO"] = "1"
# os.environ["NUMBA_DEBUG_TYPEINFER"] = "1"
os.environ["NUMBA_DUMP_LLVM"] = "1"

# os.environ["NUMBA_DUMP_FUNC_OPT"] = "1"
# os.environ["NUMBA_DUMP_OPTIMIZED"] = "1"

Only NUMBA_DUMP_LLVM can print IR, but it looks verbose as expected. How to print optimized NVVM IR then? Or is optimized NVVM IR printable?

dlee992 commented 4 months ago

So I took a quick look into https://github.com/numba/numba/blob/df07de114404225e64eea3c0622d3aee4a12e0c8/numba/cuda/codegen.py#L138-L150 I think llvm_strs should be the unoptimized LLVM IR from numba frontend? Then cuda codegen directly converts it to PTX, so in this context, we don't have the artifact of optimized NVVM IR?

Update: looks like LTO-IR is kinda similar to what I want. The code snippet above is in get_asm_str (asm should be an alias of PTX), perhaps I should look into get_ltoir. Then perhaps we should use config.DUMP_OPTIMIZED in get_ltoir to enable the dump.

But LTO-IR and ptx are mutually exclusive. And ptx is the default choice.

Link-Time Optimized IR should not be the same concept when dumping with DUMP_OTIMIZED, then I guess CUDA doesn't provide a way to show optimized IR as I always saw for CPU target.

gmarkall commented 4 months ago

The optimized NVVM IR remains internal to NVVM. You can only see the unoptimized LLVM IR. You can get LTO-IR, but it comes out as a proprietary format you can't view / disassemble.