jcuda / jcuda-main

Summarizes the main JCuda libraries
MIT License
98 stars 20 forks source link

CUDA 11.1 #37

Closed blueberry closed 3 years ago

blueberry commented 3 years ago

JCuda 11.0 can't work with CUDA 11.1 out of the box on Linux. I hope it's a tiny issue that only need linking to version 11.1 of the nvidia's libraries instead of 11.0.

@cuda I'm eager to build it as soon as you have time to update the JCuda where appropriate. (In the meantime the solution is to ignore updates and force CUDA to 11.0.x)

jcuda commented 3 years ago

@blueberry Thanks for the heads-up.

I'll have to take a look at the release notes, to see whether there are changes in the API. But since this is only a minor version update, I assume that there are no significant changes. In the best case, it's only a bugfix release, and then, it would really only be about re-linking, but ... on the one hand, NVIDIA does not seem to take "semantic vesioning" so seriously, but on the other hand, they sometimes sprinkle in some arbitrary changes...

I should have some time for that next week ...considering we're just going into a second lockdown... :-/ Maybe I can then even take another look at https://github.com/jcuda/jcuda-main/issues/36 - it would really be good to be more version-agnostic...

jcuda commented 3 years ago

Just a short heads-up: I'm busy with the update. There have been some considerable changes. A new (potentially interesting) API for PTX compilation has been added: https://docs.nvidia.com/cuda/ptx-compiler-api/index.html It should be straightforward to map this to Java, but I'll have to figure out some details.

Unfortunately, there have also been some extensions to the driver API regarding the "graph" computation, and this always is a hassle, with a distressingly bad ratio between "porting effort" and "benefit" (because the possibility to map this sensibly to Java is very limited).

I also have the impression that the NVIDIA engineers had some meeting, and did a bit of brainstorming, around the question: "How can we make an API that is clumsy, complicated and essentially unusable, and that can impossibly be mapped to any other programming language than C?". And one of them came up with an answer: "Unions!" Another one said: "Structs!". And then a diligent one said: "What about unions of structs?!". And another one threw in: "Yeah, and it should be asynchonous!". And eventually, they came up with https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html#group__CUDA__VA_1g5dc41a62a9feb68f2e943b438c83e5ab . *sigh* - maybe I'll just omit that....

However, just for fun, a small outtake from the improwements that they did in the documentation for CUDA 11.1:

ImproveDoc

blueberry commented 3 years ago

:)

Is it possible to just ignore these things for now, and update the sane parts of the API that are already in use?

jcuda commented 3 years ago

It would probably be possible to create something like a JCuda 11.0.0-for-CUDA-11.1 or JCuda 11.1.0-alpha as a preliminary release that is compiled for CUDA 11.1, but does not yet have the functionality of CUDA 11.1. But admittedly: I don't want to go through the process of creating two releases, and try to explain that "JCuda X.Y.Z is intended for CUDA X.Y, except for this case".

I'd rather try to do the update "properly". If something like this odd "union-of-structures" causes problems or cannot sensibly be ported, I'd omit it with an UnsupportedOperationException (and maybe try to implement it for CUDA 11.2 then). I'm also not happy with these delays, but it's difficult to keep my head above the water, and I'll try to finish it ASAP.

blueberry commented 3 years ago

I don't mind the delays. I'm glad that you take so much care to these details, so the waiting is well worth it!

jcuda commented 3 years ago

Still only an intermeditate update: I have done most of the changes for CUDA 11.1 itself - there's still some tedious work still ahead, but not so much.

The PTX compilation API bears some challenges right now. Either I did not understand something there conceptually, or they changed how the concept of "handles" is supposed to be implemented. Maybe I'm just blind, but when comparing it to any other runtime library, CUBLAS for example, then there are functions for creating/destroying the handles:

The ...Create functions receive a pointer, because that's the place where they actually write the data - namely, the handle (which is itself an opaque pointer). The ...Destroy function should just destroy the handle. Now... if anybody wants to try and explain why nvPTXCompilerDestroy does not receive the handle, but a pointer to a handle, I'd be very curious to hear that. The problem is: Apparently, this pointer has to be exactly the same as the one that was passed to nvPTXCompilerCreate - which, as far as I can tell, does not make any sense, and renders the concept of "handles" useless (and hard to port to Java, for that matter). I even considered filing this as a bug, but will examine it further.

jcuda commented 3 years ago

(Maybe I just need(ed) a break. The handle wasn't such a great problem after all, but confused me a bit (before finding the (acutally simple) solution))

blueberry commented 3 years ago

Fantastic! Eager to try it when you think it's ready.

jcuda commented 3 years ago

Finally, the update is done (hopefully). The release candidate to build the natives is tagged as version-11.1.0-RC00, and corresponds to the current state in master.

The biggest change in terms of the build process is due to the newly introduced nvPTXCompiler API. This part is offered, following the usual pattern, as a class JnvPTXCompiler in JCuda. This involves a new native library, similar to JCudaDriver, JCudaRuntime and JNvrtc. But this should be handled "transparently": The CMake file has been extended, so that it should find the library. (I'm not a CMake expert, and the FindCUDA.cmake file is huge and obnoxiously complex, so this was done with some copy+paste - if there are any problems, please let me know).

The functionality of this new API looks interesting, although I have to admit that right now, there appears to be some overlap with the existing JNvrtc and JIT functionality. (The latter always made some problems when trying to map it to Java - so maybe the JnvPTXCompiler could be a sensible replacement here, but I'll have to take a closer look at all that...)

Here is an example of the new API that could (after a cleanup) go into the jcuda-samples repo:

package jcuda.nvptxcompiler.test;

import static jcuda.driver.JCudaDriver.cuCtxCreate;
import static jcuda.driver.JCudaDriver.cuCtxSynchronize;
import static jcuda.driver.JCudaDriver.cuDeviceGet;
import static jcuda.driver.JCudaDriver.cuInit;
import static jcuda.driver.JCudaDriver.cuLaunchKernel;
import static jcuda.driver.JCudaDriver.cuMemAlloc;
import static jcuda.driver.JCudaDriver.cuMemFree;
import static jcuda.driver.JCudaDriver.cuMemcpyDtoH;
import static jcuda.driver.JCudaDriver.cuMemcpyHtoD;
import static jcuda.driver.JCudaDriver.cuModuleGetFunction;
import static jcuda.driver.JCudaDriver.cuModuleLoadData;
import static jcuda.nvptxcompiler.JNvPTXCompiler.nvPTXCompilerCompile;
import static jcuda.nvptxcompiler.JNvPTXCompiler.nvPTXCompilerCreate;
import static jcuda.nvptxcompiler.JNvPTXCompiler.nvPTXCompilerDestroy;
import static jcuda.nvptxcompiler.JNvPTXCompiler.nvPTXCompilerGetCompiledProgram;
import static jcuda.nvptxcompiler.JNvPTXCompiler.nvPTXCompilerGetCompiledProgramSize;
import static jcuda.nvptxcompiler.JNvPTXCompiler.nvPTXCompilerGetInfoLog;
import static jcuda.nvptxcompiler.JNvPTXCompiler.nvPTXCompilerGetInfoLogSize;
import static jcuda.nvrtc.JNvrtc.nvrtcCompileProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcCreateProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcDestroyProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcGetPTX;
import static jcuda.nvrtc.JNvrtc.nvrtcGetProgramLog;

import java.io.IOException;
import java.util.Arrays;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.CUdeviceptr;
import jcuda.driver.CUfunction;
import jcuda.driver.CUmodule;
import jcuda.driver.JCudaDriver;
import jcuda.nvptxcompiler.JNvPTXCompiler;
import jcuda.nvptxcompiler.nvPTXCompilerHandle;
import jcuda.nvrtc.JNvrtc;
import jcuda.nvrtc.nvrtcProgram;

public class JNvPTXCompilerWithJnvrtcExample
{
    /**
     * The source code of the program that will be compiled at runtime:
     * A simple vector addition kernel. 
     * 
     * Note: The function should be declared as  
     * extern "C"
     * to make sure that it can be found under the given name.
     */
    private static String programSourceCode = 
        "extern \"C\"" + "\n" +
        "__global__ void add(int n, float *a, float *b, float *sum)" + "\n" +
        "{" + "\n" +
        "    int i = blockIdx.x * blockDim.x + threadIdx.x;" + "\n" +
        "    if (i<n)" + "\n" +
        "    {" + "\n" +
        "        sum[i] = a[i] + b[i];" + "\n" +
        "    }" + "\n" +
        "}" + "\n";

    /**
     * Entry point of this sample
     * 
     * @param args Not used
     * @throws IOException 
     */
    public static void main(String[] args) throws IOException
    {
        // Enable exceptions and omit all subsequent error checks
        JCudaDriver.setExceptionsEnabled(true);
        JNvrtc.setExceptionsEnabled(true);
        JNvPTXCompiler.setExceptionsEnabled(true);

        defaultInitialization();
        byte ptxCode[] = createPtxCodeWithJNvtc(programSourceCode);
        byte elf[] = compilePtxCodeWithJNvPTXCompiler(ptxCode);
        CUfunction function = createFunction(elf, "add");
        executeFunction(function);
    }

    private static byte[] createPtxCodeWithJNvtc(String sourceCode)
    {
        // Use the NVRTC to create a program by compiling the source code
        nvrtcProgram program = new nvrtcProgram();
        nvrtcCreateProgram(program, sourceCode, null, 0, null, null);
        nvrtcCompileProgram(program, 0, null);

        // Print the compilation log (for the case there are any warnings)
        String programLog[] = new String[1];
        nvrtcGetProgramLog(program, programLog);
        System.out.println("Program compilation log:\n" + programLog[0]);        

        // Obtain the PTX ("CUDA Assembler") code of the compiled program
        String[] ptx = new String[1];
        nvrtcGetPTX(program, ptx);

        // Clean up
        nvrtcDestroyProgram(program);

        // TODO: Appending the 0-terminating byte. Quick+dirty, improve this!
        byte[] ptxCode = ptx[0].getBytes();
        ptxCode = Arrays.copyOf(ptxCode, ptxCode.length + 1);

        return ptxCode;
    }

    private static byte[] compilePtxCodeWithJNvPTXCompiler(byte ptxCode[])
    {
        // Create a compiler handle
        nvPTXCompilerHandle compiler = new nvPTXCompilerHandle();
        nvPTXCompilerCreate(compiler, ptxCode.length, ptxCode);

        // Do the compilation
        String compileOptions[] = { 
            "--gpu-name=sm_70",
            "--verbose"
        };
        nvPTXCompilerCompile(compiler, compileOptions.length, compileOptions);

        // Print some log information
        long infoSize[] = { -1 };
        nvPTXCompilerGetInfoLogSize(compiler, infoSize);
        if (infoSize[0] != 0) 
        {
            byte infoLog[] = new byte[(int)infoSize[0]];
            nvPTXCompilerGetInfoLog(compiler, infoLog);
            System.out.println("Info log: " + new String(infoLog));
        }

        // Obtain the compiled binary (ELF)
        long elfSize[] = { -1 };
        nvPTXCompilerGetCompiledProgramSize(compiler, elfSize);
        byte elf[] = new byte[(int)elfSize[0]];
        nvPTXCompilerGetCompiledProgram(compiler, elf);

        // Clean up
        nvPTXCompilerDestroy(compiler);

        return elf;
    }

    private static void defaultInitialization()
    {
        // Initialize the driver and create a context for the first device.
        cuInit(0);
        CUdevice device = new CUdevice();
        cuDeviceGet(device, 0);
        CUcontext context = new CUcontext();
        cuCtxCreate(context, 0, device);
    }

    private static CUfunction createFunction(byte elf[], String functionName)
    {
        // Create a CUDA module from the compiled data
        CUmodule module = new CUmodule();
        cuModuleLoadData(module, elf);

        // Obtain the function pointer to the function from the module
        CUfunction function = new CUfunction();
        cuModuleGetFunction(function, module, functionName);

        return function;
    }

    private static void executeFunction(CUfunction function)
    {
        // Allocate and fill the host input data
        int numElements = 256 * 100;
        float hostInputA[] = new float[numElements];
        float hostInputB[] = new float[numElements];
        for(int i = 0; i < numElements; i++)
        {
            hostInputA[i] = (float)i;
            hostInputB[i] = (float)i;
        }

        // Allocate the device input data, and copy the
        // host input data to the device
        CUdeviceptr deviceInputA = new CUdeviceptr();
        cuMemAlloc(deviceInputA, numElements * Sizeof.FLOAT);
        cuMemcpyHtoD(deviceInputA, Pointer.to(hostInputA),
            numElements * Sizeof.FLOAT);
        CUdeviceptr deviceInputB = new CUdeviceptr();
        cuMemAlloc(deviceInputB, numElements * Sizeof.FLOAT);
        cuMemcpyHtoD(deviceInputB, Pointer.to(hostInputB),
            numElements * Sizeof.FLOAT);

        // Allocate device output memory
        CUdeviceptr deviceOutput = new CUdeviceptr();
        cuMemAlloc(deviceOutput, numElements * Sizeof.FLOAT);

        // Set up the kernel parameters: A pointer to an array
        // of pointers which point to the actual values.
        Pointer kernelParameters = Pointer.to(
            Pointer.to(new int[]{numElements}),
            Pointer.to(deviceInputA),
            Pointer.to(deviceInputB),
            Pointer.to(deviceOutput)
        );

        // Call the kernel function, which was obtained from the
        // module that was compiled at runtime
        int blockSizeX = 256;
        int gridSizeX = (numElements + blockSizeX - 1) / blockSizeX;
        cuLaunchKernel(function,
            gridSizeX,  1, 1,      // Grid dimension
            blockSizeX, 1, 1,      // Block dimension
            0, null,               // Shared memory size and stream
            kernelParameters, null // Kernel- and extra parameters
        );
        cuCtxSynchronize();

        // Allocate host output memory and copy the device output
        // to the host.
        float hostOutput[] = new float[numElements];
        cuMemcpyDtoH(Pointer.to(hostOutput), deviceOutput,
            numElements * Sizeof.FLOAT);

        // Verify the result
        boolean passed = true;
        for(int i = 0; i < numElements; i++)
        {
            float expected = i+i;
            if (Math.abs(hostOutput[i] - expected) > 1e-5)
            {
                System.out.println(
                    "At index "+i+ " found "+hostOutput[i]+
                    " but expected "+expected);
                passed = false;
                break;
            }
        }
        System.out.println("Test "+(passed?"PASSED":"FAILED"));

        // Clean up.
        cuMemFree(deviceInputA);
        cuMemFree(deviceInputB);
        cuMemFree(deviceOutput);
    }

}

In fact, the few lines of code that are currently in the main are something that I'd also like to offer via a utility class - maybe it's time to revive the https://github.com/jcuda/jcuda-utils for that. Together with the (once so useful) KernelLauncher class, this could, in the best case, allow writing at least basic CUDA functionality as simple as

Magic magic = compile(vectorAddCode, "add");

int n = 100;
CUdevicePtr vectorA = copyToDevice(createVector(n));
CUdevicePtr vectorB = copyToDevice(createVector(n));
CUdevicePtr vectorC = allocateFloats(n);
magic.call(vectorA, vectorB, vectorC, n);
float result[] = copyToHost(vectorC);

(For a basic vector addition, this wouldn't be necessary, because it should be covered by https://github.com/jcuda/jcuda-vec , but for "(slightly) more complex" kernels, such a utility infrastructure with Just-In-Time compilation could be handy...)

Maybe I'll do that.... when I have time :-/


Edit: As mentioned earlier, there is something that I skipped during this update: The structure https://github.com/jcuda/jcuda/blob/master/JCudaJava/src/main/java/jcuda/driver/CUarrayMapInfo.java is too complex for me to sensibly map this to Java within a reasonable time frame. If somebody wants to use this, he can create his own CUDA JNI bindings.

blueberry commented 3 years ago

Thanks! Did you also include a dummy macOS loader, or it should wait for better times?

jcuda commented 3 years ago

Unfortunately, this is still on the TODO list. But I'll definitely try to tackle this before the next release! (Unless CUDA 11.2 is published tomorrow - you never know). In the best case, it is only a small change, and when I'm touching/reviewing this part anyhow, it might be a chance to see whether there is also a "simple" solution for https://github.com/jcuda/jcuda-main/issues/36

blueberry commented 3 years ago

Hi Marco,

Here's the linux build. It went without any hiccups, and all the tests from my higher-level libraries pass. Built with the latest CUDA from the Arch linux repos, which is 11.1.1. jcuda-11.1-linux.zip

Thank you, and sorry for being a bit late with this.

jcuda commented 3 years ago

(I'm the one causing the delays here :-/ ) Thanks, I'll do the release tomorrow (or not later than Thursday).

jcuda commented 3 years ago

Built with the latest CUDA from the Arch linux repos, which is 11.1.1.

Thanks for that subtle hint. I had called it 11.1.0 initially, but now had another look: The version number that I used was actually cuda_11.1.1_456.81_win10.exe, referred to as "CUDA Toolkit 11.1 Update 1" on the website. I have updated the version number now, to match the CUDA version, so the release is Version 11.1.1. (I have renamed the native libraries that you provided - carefully, and I hope that I didn't make some stupid typo or other mistake somewhere...).

This essentially means that there hasn't been a "JCuda 11.1.0" version (i.e. no release was done before their "Update 1"). But I hope that this is not so much an issue.

If there are JCuda-specific, internal updates, they will be 11.1.1b or so - we already did that back when we used the 0.major.minor version numbering. One example might be the Library loading issues: If there is an update for handling the missing MacOS support, or for using the javacpp binary releases, then this update would be 11.1.1b, and use the same native libraries internally.

The Sonatype Nexus Maven Staging upload was/is a bit slow today, but JCuda 11.1.1 is on its way into Maven Central, and will be available under the usual coordinates:

<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcuda</artifactId>
    <version>11.1.1</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcublas</artifactId>
    <version>11.1.1</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcufft</artifactId>
    <version>11.1.1</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcusparse</artifactId>
    <version>11.1.1</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcusolver</artifactId>
    <version>11.1.1</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcurand</artifactId>
    <version>11.1.1</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcudnn</artifactId>
    <version>11.1.1</version>
</dependency>

I'll leave this issue open until things are settled (e.g. README.md updates, maybe added samples for the nvPTXCompiler etc).

And as usual, a huge thank you @blueberry 🥇 without you, there probably wouldn't have been any releases after JCuda 0.8.0 or so...