Add support for compiling to SoA ABI for CUDA

gmarkall commented 10 months ago

This adds a new function, compile_ptx_soa(), which works similarly to Numba's compile_ptx() function, but the compiled code's writes the elements of tuple return types into SoA data structures passed in as pointers.

Example

The Python function:

def addsub(x, y):
    return x + y, x - y

compiled with the compile_ptx_soa() function where:

Arguments are of int32 type
The return type is a tuple of (int32, int32)

compiles to PTX equivalent to th C function:

void (int32_t *r1, int32_t *r2, int32_t x, int32_t y)
{
  *r1 = x + y;
  *r2 = x - y;
}

Or, as the actual PTX produced:

.visible .func addsub(
    .param .b64 addsub_param_0,
    .param .b64 addsub_param_1,
    .param .b32 addsub_param_2,
    .param .b32 addsub_param_3
)
{
    .reg .b32   %r<5>;
    .reg .b64   %rd<3>;

    ld.param.u64    %rd1, [addsub_param_0];
    ld.param.u64    %rd2, [addsub_param_1];
    ld.param.u32    %r1, [addsub_param_2];
    ld.param.u32    %r2, [addsub_param_3];
    add.s32     %r3, %r2, %r1;
    sub.s32     %r4, %r1, %r2;
    st.u32  [%rd1], %r3;
    st.u32  [%rd2], %r4;
    ret;

}

Returning a heterogeneous tuple is also possible, For example where the return type is specified as a tuple of (int32, float32) we get:

.visible .func addsub(
    .param .b64 addsub_param_0,
    .param .b64 addsub_param_1,
    .param .b32 addsub_param_2,
    .param .b32 addsub_param_3
)
{
    .reg .f32   %f<2>;
    .reg .b32   %r<4>;
    .reg .b64   %rd<6>;

    ld.param.u64    %rd1, [addsub_param_0];
    ld.param.u64    %rd2, [addsub_param_1];
    ld.param.u32    %r1, [addsub_param_2];
    ld.param.u32    %r2, [addsub_param_3];
    add.s32     %r3, %r2, %r1;
    cvt.s64.s32     %rd3, %r1;
    cvt.s64.s32     %rd4, %r2;
    sub.s64     %rd5, %rd3, %rd4;
    cvt.rn.f32.s64  %f1, %rd5;
    st.u32  [%rd1], %r3;
    st.f32  [%rd2], %f1;
    ret;
}

(Note the st.u32 for the first return value vs. st.f32 for the second).

copy-pr-bot[bot] commented 10 months ago

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

magnatelee commented 10 months ago

/ok to test

gmarkall commented 9 months ago

@manopapad Many thanks for having a look over this - I'll resolve the issues you noted shortly.

It looks like tests also failed because Numba is not a dependency of cuNumeric, but it will need to be (at least for numba_utils to work) - I have looked through the various actions scripts and bits and pieces but I can't see what the correct way to ensure Numba is installed is - could you suggest how I can make sure Numba gets installed as required please?

manopapad commented 9 months ago

could you suggest how I can make sure Numba gets installed as required please?

Can you try adding numba around here? https://github.com/nv-legate/cunumeric/blob/branch-24.01/conda/conda-build/meta.yaml#L148

gmarkall commented 9 months ago

Thanks, just giving that a try now.

gmarkall commented 9 months ago

/ok to test

RAMitchell commented 9 months ago

What are the python dependencies being introduced here? Just a note that we may want to make some features optional if they significantly increase packaging complexity.

gmarkall commented 9 months ago

The dependency being added is Numba, which also depends on llvmlite. These are both available on PyPI as wheels and on conda-forge for Linux x86_64, ppcle64, AArch64, macOS x86_64 and arm64, and Windows on x86_64.

In general I think projects adding Numba as a dependency don't tend to have issues with increased packaging complexity - are there any special cuNumeric packaging requirements I should think about / elaborate on here?

manopapad commented 9 months ago

As @gmarkall noted, numba is widely available, so I'm not very worried about making it a hard dependency. That said, if there's pushback we can certainly make it optional (only necessary if the user wants to use np.vectorize or similar UDF-accepting functions).

manopapad commented 7 months ago

@bryevdv we've gotten a bit bogged down with mypy on this, would you have some bandwidth to help resolve them?

bryevdv commented 7 months ago

@manopapad These errors are because the stubs under typings/numba are not sufficiently fleshed out. For instance, this change:

diff --git a/typings/numba/core/codegen.pyi b/typings/numba/core/codegen.pyi
index a4288cce..409c94c0 100644
--- a/typings/numba/core/codegen.pyi
+++ b/typings/numba/core/codegen.pyi
@@ -1,4 +1,7 @@
 class CodeLibrary:
     codegen: "Codegen"

+    @property
+    def name(self) -> str: ...
+
 class Codegen: ...

fixes this error:

cunumeric/numba_utils.py:153:9: error: "CodeLibrary" has no attribute "name"  [attr-defined]
            f"{lib.name}_function_",
            ^~~~~~~~~~~

Either the stubs need to be expanded to cover the usage in this file, or else this file added to the override exclusions in pyproject.toml

manopapad commented 7 months ago

@gmarkall Maybe for now it would be best to add the numba modules that are causing issues to the [[tool.mypy.overrides]] > module section of pyproject.toml, and you can later decide if you want to continue the effort of providing type hints for numba modules?

gmarkall commented 5 months ago

In the end I couldn't figure out how to make the overrides work so I ended up fixing up the typing. I've merged branch 24.03 and got CI green, so I think this is now ready for consideration / feedback.

bryevdv commented 5 months ago

@manopapad This is fairly self-contained, so it should be straightforward to port to internal without much conflict, but at this point, should it just land in internal to begin with?

nv-legate / cunumeric

Add support for compiling to SoA ABI for CUDA #1103

Example