Official releases - Githubissues

blueberry commented 8 years ago

Since CLBlast 0.7.0 is out, maybe we can prepare the release 0.7.0 of JOCLBlast (and also RC01 of JOCL)? We have people that can build for all 3 major operating systems...

blueberry commented 8 years ago

@gpu Believe it or not, i have been doing some windows build work for the atlas-based CPU engine of neanderthal, and this was the last peace of the puzzle for the neanderthal release. What coincidence :) Thank you, Marco and Amaury!

gpu commented 8 years ago

@blueberry This time, I was quicker :-) but only with noticing that there is a https://github.com/CNugteren/CLBlast/tree/0.10.0 . Unfortunately, it may still take a week or two until I can tackle the update, but considering that there seem to be no significant changes in the API, and the fixed bugs don't seem "critical" (from quickly skimming over the change log), I hope that this is OK.

CNugteren commented 8 years ago

There are no changes in the C++ API, which you are using right? There is now also a Netlib compatible API, but that's not recommended for performance. And there are some changes in the C API to make the chance of a name-clash less likely:

- Changed the enums in the C API to avoid potential name clashes with external code
- Added a Netlib CBLAS compatible API (not recommended for full control over performance)

gpu commented 8 years ago

I'm using the C API. And I noticed the part about renamed enums. Strictly speaking, this as a compatibility-breaking change in the API, but I'll think about how to deal with this when I start the update:

On the one hand, it will not be necessary to let this change become visible in the JOCLBlast layer (org.jocl.blast.Layout is a unique name :-)). But usually, I try to follow the naming of the underlying library as closely as possible, so I might as well rename the corresponding classes.

@blueberry Any thoughts or preferences regarding this point?

CNugteren commented 8 years ago

OK, I understand. In that case yes, the C-API has changed such that it is not compatible with previous versions. I though this was a quite important change, better sooner than later. And it's still a pre-1.0 version :slightly_smiling_face:

blueberry commented 8 years ago

@gpu I prefer the technically "better" solution over supporting legacy code, at least in the pre-1.0 versions. So, whatever changes you need or prefer to make, please make them, and I'll update neanderthal accordingly. Please just make a list of the changes, so I can be sure I updated all relevant parts.

blueberry commented 8 years ago

@gpu @CNugteren What is the relevance of this netlib addition? It seems to me that they are irrelevant for JOCLBlast, and more like CLBlast's support for legacy code?

CNugteren commented 8 years ago

The Netlib API is really meant as a drop-in replacement and is not the main focus of the CLBlast project. It can actually yield very poor performance because of extra data copies (especially level 1 and level 2 routines), but can sometimes give a 'free' performance improvement over CPU code for level 3 routines. It basically calls the regular CLBlast API but does a device initialization and host-devices copies before and after.

The changes to the C-API are extra error codes and a 'CLBlast' prefix to all enums and constants (click on 'load diff') including for the new status codes.

gpu commented 7 years ago

@blueberry and @amherag : The tag for building the native libraries of version 0.10.0 has been added:

https://github.com/gpu/JOCLBlast/releases/tag/0.10.0-RC00

The changes are basically just following the CLBlast changes:

The Diagonal, Layout, Side, StatusCode, Transpose and Triangle classes have been renamed to CLBlastDiagonal, CLBlastLayout, CLBlastSide, CLBlastStatusCode, CLBlastTranspose and CLBlastTriangle, respectively.
The constants in these classes (which are basically enums) have been renamed accordingly, from names like kNonUnit to CLBlastDiagonalNonUnit

@CNugteren I was a bit confused when I saw that the Precision enum was removed from the C interface header, but I assume that this was intentional, because it was not used in the C interface.

BTW: Recently, I considered creating a small utility library for handling the "half" data type in Java. CLBlast already has dedicated methods for this data type, and it is used in other libraries as well (most prominently in cuDNN, so I could use it for https://github.com/jcuda/jcudnn ...). I'm not sure about the performance implications, though: In Java, one could only have a float[] array, and write it into a ShortBuffer where each short contains the 16 bits of the half value that corresponds to the float. This conversion is not for free. But maybe it would be compensated by the higher performance that half may achieve internally...?

blueberry commented 7 years ago

(Edited slightly by gpu)

Hi @gpu, @amherag, and @CNugteren I'll build this in three weeks. I'm sorry I won't be able to do it sooner, but I hope it won't cause any delay to the users of this fantastically useful library.

I wish you all happy hollidays!

CNugteren commented 7 years ago

@gpu Indeed, the Precision enum was not really used at all in the API, so I removed it. Forgot to tell you, sorry!

About half-precision: In OpenCL on the host there is a cl_half data-type, but it is just a different name for a 16-bits short and there are no operations possible on them. I include a small header clblast_half.h in the CLBlast repository to do float-to-half and half-to-float conversions, but it is only used in the tests and samples and so on, not in the library itself. It is up to the user to do the conversion and up to the user to decide whether or not the conversion-cost is worth the faster computation. But for example in deep-learning values can stay 16-bits for a long time or might have never have to be converted to 32-bits at all: all arithmetic happens on the GPU.

Happy new year!

gpu commented 7 years ago

AFAIK, the half data type can be enabled via an extension in OpenCL kernels. But I have to admit that I do not (yet) know much about its role. I think that nearly all GPUs will internally do the computations with float anyhow. So the main purpose of half is to save space for "large" arrays (e.g. matrices) for which the precision does not need to be so high, but it probably has no positive effect on performance.

The deep learning applications seem to be one field where the precision of half is sufficient, and the memory savings are imporant. Recently, NVIDIA has added dedicated support for half to their GPUs, mainly for deep learning (although I don't know much about the details here). In any case, a nice, convenient support for half on Java side would be nice to have, and might have an increasing number of application cases in the near future.

CNugteren commented 7 years ago

Indeed, the half data-type can be used to save space, which is also important for deep learning. But the latest GPUs now support half-precision (FP16) arithmetic at double the performance of single precision (FP32). This yields a direct 2x performance improvement for deep learning. Examples are NVIDIA's Pascal P100 and AMD's announced MI25 Vega GPU. But also several embedded GPUs can already do half-precision at twice the speed. Examples include ARM's Mali GPUs and Intel's GPU which you can find on-die with a CPU. See the CLBlast repository for an example benchmark.

blueberry commented 7 years ago

@gpu @amherag linux build 0.10.0 is here: joclblast-0.10.0-linux.zip

EDIT: I noticed that I haven't included libclblast.so so I updated the archive. I also tuned it for nvidia-gtx-1080 (this will be included by default in the next release of clblast)

blueberry commented 7 years ago

@gpu updated the build

amherag commented 7 years ago

@gpu @blueberry mac build 0.10.0: jocl-blast-0.10.0-SNAPSHOT.jar.zip

Sorry for the delay.

gpu commented 7 years ago

Version 0.10.0 has been released, and will be available in Maven Central soon.

Thanks @blueberry and @amherag for your contributions!

blueberry commented 7 years ago

@gpu @amherag Just to notify you that CLBlast 0.11.0 has been released.

amherag commented 7 years ago

Do I use JOCLBlast RC 0.10.0 for this build?

2017-05-02 12:12 GMT-07:00 Dragan Djuric notifications@github.com:

@gpu https://github.com/gpu @amherag https://github.com/amherag Just to notify you that CLBlast 0.11.0 has been released.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gpu/JOCLBlast/issues/8#issuecomment-298732248, or mute the thread https://github.com/notifications/unsubscribe-auth/AAYq51WSSsdLN_q7nTqCkS4XoV8-EtcXks5r14AUgaJpZM4IZuSJ .

blueberry commented 7 years ago

I guess we should wait for @gpu to prepare new JOCLBlast, even if the change is only a version number upgrade.

CNugteren commented 7 years ago

There are also some additions to the API: 2 new batched routines, one override-parameters function, and a couple new error codes. You can see the diff to the header here.

gpu commented 7 years ago

Thanks for the pointer. I'll try to do the update (early) next week, and drop you a note here.

gpu commented 7 years ago

The update is basically done, as of https://github.com/gpu/JOCLCommon/commit/b671b0cdd79123274b60072ccc6a328773dde117 and https://github.com/gpu/JOCLBlast/commit/0103ae6033e56e7b17879b372eb3d9af24a10f88

I'd like to test the new functionalities, e.g. the batched routines and this "parameter override" thingy. I already started a JOCLBlastCaxpySample, and will try to create an example for the parameter overriding as well.

So... I'm not sure: @blueberry and @amherag You could create the libraries from the current state, although this is not (yet) tagged as a "release candidate". If there are any problems with this state, I'd have to update it, but I would like to avoid declaring the current (untested) state as a "RC"...

blueberry commented 7 years ago

No problem, I'll wait for it to be ready. Do you have an approximate time estimate fot the RC?

gpu commented 7 years ago

I'll try to create the samples tomorrow (not sure how to test this "parameter override", but at least a batched example), so hopefully, the RC tag can be created tomorrow as well.

CNugteren commented 7 years ago

There is a small parameter override test in CLBlast, maybe that will help you: https://github.com/CNugteren/CLBlast/blob/master/test/correctness/misc/override_parameters.cpp

gpu commented 7 years ago

@blueberry and @amherag The RC tag for 0.11.0 is at https://github.com/gpu/JOCLBlast/releases/tag/0.11.0-RC00

@CNugteren Thanks. I have created a "simplified port" of this class for testing the OverrideParameters functionality in JOCLBlast:

package org.jocl.samples.blast;

import static org.jocl.CL.*;

import java.util.*;

import org.jocl.*;
import org.jocl.blast.*;

/**
 * An example for using the OverrideParameters functionality of CLBlast.
 *
 * This example is basically a (simplified) port of the original test at
 * https://github.com/CNugteren/CLBlast/blob/
 *     f24c142948fc71d8b37826c1275259668fe0d0e5/test/
 *     correctness/misc/override_parameters.cpp
 *     
 */
public class JOCLBlastOverrideTest
{
    // The platform, device type and device number
    // that will be used
    static final int platformIndex = 0;
    static final long deviceType = CL_DEVICE_TYPE_ALL;
    static final int deviceIndex = 0;

    private static cl_device_id device;
    private static cl_context context;
    private static cl_command_queue commandQueue;

    public static void main(String[] args)
    {
    int errors = 0;
    int passed = 0;
    int kSeed = 42; // fixed seed for reproducibility

    // Determines the test settings
    String routine_name = "SGEMM";
    String kernel_name = "Xgemm";
    int precision = CLBlastPrecision.CLBlastPrecisionSingle;
    List<Map<String, Integer>> valid_settings = createValidSettings();
    List<Map<String, Integer>> invalid_settings = createInvalidSettings();

    // Retrieves the arguments
    int m = 256;
    int n = 256;
    int k = 256;
    int a_ld = k;
    int b_ld = n;
    int c_ld = n;
    int a_offset = 0;
    int b_offset = 0;
    int c_offset = 0;
    int layout = CLBlastLayout.CLBlastLayoutRowMajor;
    int a_transpose = CLBlastTranspose.CLBlastTransposeNo;
    int b_transpose = CLBlastTranspose.CLBlastTransposeNo;
    float alpha = 0.0f;
    float beta  = 0.0f;

    // Initialize OpenCL
    defaultInitialization();

    // Populate host matrices with some example data
    float host_a[] = new float[m * k];
    float host_b[] = new float[n * k];
    float host_c[] = new float[m * n];
    Random random = new Random(kSeed);
    populateVector(host_a, random);
    populateVector(host_b, random);
    populateVector(host_c, random);

    // Copy the matrices to the device
    cl_mem device_a = copyToDevice(host_a);
    cl_mem device_b = copyToDevice(host_b);
    cl_mem device_c = copyToDevice(host_c);

    System.out.printf(
        "* Testing OverrideParameters for '%s'\n", routine_name);

    // Loops over the valid combinations: run before and run afterwards
    for (Map<String, Integer> override_setting : valid_settings)
    {
        // Call with the default parameters
        int status_before = CLBlast.CLBlastSgemm(
        layout, a_transpose, b_transpose, m, 
        b_transpose, k, alpha, device_a, a_offset, 
        a_ld, device_b, b_offset, b_ld, beta, 
        device_c, c_offset, c_ld, commandQueue, null);
        CL.clFinish(commandQueue);

        if (status_before != CLBlastStatusCode.CLBlastSuccess)
        {
        errors++;
        continue;
        }

        // Overrides the parameters
        int num_parameters = override_setting.size();
        String parameters_names[] = 
        override_setting.keySet().toArray(new String[0]);
        long[] parameters_values = 
        extractParameterValues(override_setting.values());
        int status = CLBlast.CLBlastOverrideParameters(
        device, kernel_name, precision, num_parameters, 
        parameters_names, parameters_values);

        if (status != CLBlastStatusCode.CLBlastSuccess)
        {
        errors++;
        continue;
        }

        // Call with the overridden parameters
        int status_after = CLBlast.CLBlastSgemm(
        layout, a_transpose, b_transpose, m, 
        b_transpose, k, alpha, device_a, a_offset, 
        a_ld, device_b, b_offset, b_ld, beta, 
        device_c, c_offset, c_ld, commandQueue, null);
        CL.clFinish(commandQueue);

        if (status_after != CLBlastStatusCode.CLBlastSuccess)
        {
        errors++;
        continue;
        }

        passed++;
    }

    // Loops over the valid combinations: run before and run afterwards
    for (Map<String, Integer> override_setting : invalid_settings)
    {
        // Call with the default parameters
        int status_before = CLBlast.CLBlastSgemm(
        layout, a_transpose, b_transpose, m, 
        b_transpose, k, alpha, device_a, a_offset, 
        a_ld, device_b, b_offset, b_ld, beta, 
        device_c, c_offset, c_ld, commandQueue, null);
        CL.clFinish(commandQueue);

        if (status_before != CLBlastStatusCode.CLBlastSuccess)
        {
        errors++;
        continue;
        }

        // Overrides the parameters
        int num_parameters = override_setting.size();
        String parameters_names[] = 
        override_setting.keySet().toArray(new String[0]);
        long[] parameters_values = 
        extractParameterValues(override_setting.values());
        int status = CLBlast.CLBlastOverrideParameters(
        device, kernel_name, precision, num_parameters, 
        parameters_names, parameters_values);

        if (status == CLBlastStatusCode.CLBlastSuccess) // expecting error
        {
        errors++;
        continue;
        }

        // Call again (using the default parameters)
        int status_after = CLBlast.CLBlastSgemm(
        layout, a_transpose, b_transpose, m, 
        b_transpose, k, alpha, device_a, a_offset, 
        a_ld, device_b, b_offset, b_ld, beta, 
        device_c, c_offset, c_ld, commandQueue, null);
        CL.clFinish(commandQueue);

        if (status_after != CLBlastStatusCode.CLBlastSuccess)
        {
        errors++;
        continue;
        }

        passed++;
    }

    // Print the statistics
    System.out.printf("    %d test(s) passed\n", passed);
    System.out.printf("    %d test(s) failed\n", errors);
    System.out.printf("\n");
    }

    private static List<Map<String, Integer>> createValidSettings()
    {
    List<Map<String, Integer>> validSettings = 
        new ArrayList<Map<String, Integer>>();

    Map<String, Integer> map = null;

    map = new LinkedHashMap<String, Integer>();
    map.put("KWG",16);
    map.put("KWI",2);
    map.put("MDIMA",4);
    map.put("MDIMC",4);
    map.put("MWG",16);
    map.put("NDIMB",4);
    map.put("NDIMC",4);
    map.put("NWG",16);
    map.put("SA",0);
    map.put("SB",0);
    map.put("STRM",0);
    map.put("STRN",0);
    map.put("VWM",1);
    map.put("VWN",1);
    validSettings.add(map);

    map = new LinkedHashMap<String, Integer>();
    map.put("KWG",32);
    map.put("KWI",2);
    map.put("MDIMA",4);
    map.put("MDIMC",4);
    map.put("MWG",32);
    map.put("NDIMB",4);
    map.put("NDIMC",4);
    map.put("NWG",32);
    map.put("SA",0);
    map.put("SB",0);
    map.put("STRM",0);
    map.put("STRN",0);
    map.put("VWM",1);
    map.put("VWN",1);
    validSettings.add(map);

    return validSettings;
    }

    private static List<Map<String, Integer>> createInvalidSettings()
    {
    List<Map<String, Integer>> invalidSettings = 
        new ArrayList<Map<String, Integer>>();

    Map<String, Integer> map = null;

    map = new LinkedHashMap<String, Integer>();
    map.put("KWI",2);
    map.put("MDIMA",4);
    map.put("MDIMC",4);
    map.put("MWG",16);
    map.put("NDIMB",4);
    map.put("NDIMC",4);
    map.put("NWG",16);
    map.put("SA",0);
    invalidSettings.add(map);

    return invalidSettings;
    }

    private static long[] extractParameterValues(Collection<Integer> integers)
    {
    long result[] = new long[integers.size()];
    int index = 0;
    for (Integer integer : integers)
    {
        result[index] = integer;
        index++;
    }
    return result;
    }

    private static void populateVector(float a[], Random random)
    {
    for (int i=0; i<a.length; i++)
    {
        a[i] = random.nextFloat();
    }
    }

    private static cl_mem copyToDevice(float host[])
    {
    cl_mem device = clCreateBuffer(context, CL_MEM_READ_WRITE,
        host.length * Sizeof.cl_float, null, null);
    clEnqueueWriteBuffer(commandQueue, device, CL_TRUE, 0,
        host.length * Sizeof.cl_float, 
        Pointer.to(host), 0, null, null);
    return device;
    }

    /**
     * Default OpenCL initialization of the device, context and command queue
     */
    private static void defaultInitialization()
    {
    // Enable exceptions and subsequently omit error checks in this sample
    CL.setExceptionsEnabled(true);

    // Obtain the number of platforms
    int numPlatformsArray[] = new int[1];
    clGetPlatformIDs(0, null, numPlatformsArray);
    int numPlatforms = numPlatformsArray[0];

    // Obtain a platform ID
    cl_platform_id platforms[] = new cl_platform_id[numPlatforms];
    clGetPlatformIDs(platforms.length, platforms, null);
    cl_platform_id platform = platforms[platformIndex];

    // Initialize the context properties
    cl_context_properties contextProperties = new cl_context_properties();
    contextProperties.addProperty(CL_CONTEXT_PLATFORM, platform);

    // Obtain the number of devices for the platform
    int numDevicesArray[] = new int[1];
    clGetDeviceIDs(platform, deviceType, 0, null, numDevicesArray);
    int numDevices = numDevicesArray[0];

    // Obtain a device ID
    cl_device_id devices[] = new cl_device_id[numDevices];
    clGetDeviceIDs(platform, deviceType, numDevices, devices, null);
    device = devices[deviceIndex];

    // Create a context for the selected device
    context = clCreateContext(
        contextProperties, 1, new cl_device_id[]{device},
        null, null, null);

    String deviceName = getString(devices[0], CL_DEVICE_NAME);
    System.out.printf("CL_DEVICE_NAME: %s\n", deviceName);

    // Create a command-queue
    commandQueue = clCreateCommandQueue(
        context, devices[0], 0, null);

    }

    private static String getString(cl_device_id device, int paramName)
    {
    // Obtain the length of the string that will be queried
    long size[] = new long[1];
    clGetDeviceInfo(device, paramName, 0, null, size);

    // Create a buffer of the appropriate size and fill it with the info
    byte buffer[] = new byte[(int)size[0]];
    clGetDeviceInfo(device, paramName, buffer.length, 
        Pointer.to(buffer), null);

    // Create a string from the buffer (excluding the trailing \0 byte)
    return new String(buffer, 0, buffer.length-1);
    }

}

Also, a small test/example for the CLBlastCaxpyBatched function:

package org.jocl.samples.blast;

import static org.jocl.CL.*;
import static org.jocl.blast.CLBlast.CLBlastCaxpyBatched;

import java.nio.FloatBuffer;
import java.util.Locale;

import org.jocl.*;
import org.jocl.blast.CLBlast;

/**
 * An example for using the batched CAXPY function from CLBlast to compute
 * Y = a * X + Y
 * for several single-precision complex number vectors
 */
public class JOCLBlastCaxpyBatchedSample
{
    private static cl_context context;
    private static cl_command_queue commandQueue;

    /**
     * The entry point of this sample
     *
     * @param args Not used
     */
    public static void main(String args[])
    {
    CL.setExceptionsEnabled(true);
    CLBlast.setExceptionsEnabled(true);

    defaultInitialization();

    // Create the host input data. Each entry of these vectors consists 
    // of TWO values, which are the real- and imaginary part of the 
    // complex number
    int numVectors = 3;
    int vectorSize = 5;

    // 3 vectors, each with 5 dimensions (*2, for real- and imaginary part)
    float X[] =  
    {
        1,1, 1,2, 1,3, 1,4, 1,5,
        2,1, 2,2, 2,3, 2,4, 2,5,
        3,1, 3,2, 3,3, 3,4, 3,5,
    };
    // 3 vectors, each with 5 dimensions (*2, for real- and imaginary part)
    float Y[] =
    {
        4,1, 4,2, 4,3, 4,4, 4,5,
        5,1, 5,2, 5,3, 5,4, 5,5,
        6,1, 6,2, 6,3, 6,4, 6,5,
    };

    // Create the device input buffers
    cl_mem memX = clCreateBuffer(context, CL_MEM_READ_ONLY,
        vectorSize * numVectors * Sizeof.cl_float2, null, null);
    cl_mem memY = clCreateBuffer(context, CL_MEM_READ_ONLY,
        vectorSize * numVectors * Sizeof.cl_float2, null, null);

    // Copy the host data to the device
    clEnqueueWriteBuffer(commandQueue, memX, CL_TRUE, 0,
        vectorSize * numVectors * Sizeof.cl_float2, 
        Pointer.to(X), 0, null, null);
    clEnqueueWriteBuffer(commandQueue, memY, CL_TRUE, 0,
        vectorSize * numVectors * Sizeof.cl_float2, 
        Pointer.to(Y), 0, null, null);

    // 3 factors to be multiplied with X (*2, for real- and imaginary part)
    float alphas[] = { 1,2, 2,3, 3,4 };

    // Execute batched CAXPY: Y = alpha * X + Y
    cl_event event = new cl_event();
    CLBlastCaxpyBatched(vectorSize, alphas, 
        memX, new long[] { 0, 5, 10 }, 1, 
        memY, new long[] { 0, 5, 10 }, 1,  
        numVectors, commandQueue, event);

    // Wait for the computation to be finished
    clWaitForEvents( 1, new cl_event[] { event });

    // Copy the result data back to the host
    float resultY[] = new float[vectorSize * numVectors * 2];
    clEnqueueReadBuffer(commandQueue, memY, CL_TRUE, 0,
        vectorSize * numVectors * Sizeof.cl_float2, 
        Pointer.to(resultY), 0, null, null);

    // Print the inputs and the result
    System.out.println("a:");
    printComplex2D(FloatBuffer.wrap(alphas), 1);

    System.out.println("X:");
    printComplex2D(FloatBuffer.wrap(X), vectorSize);

    System.out.println("Y:");
    printComplex2D(FloatBuffer.wrap(Y), vectorSize);

    System.out.println("Result:");
    printComplex2D(FloatBuffer.wrap(resultY), vectorSize);

    // Clean up
    clReleaseMemObject(memX);
    clReleaseMemObject(memY);
    clReleaseCommandQueue(commandQueue);
    clReleaseContext(context);        
    }

    /**
     * Default OpenCL initialization of the context and command queue
     */
    private static void defaultInitialization()
    {
    // The platform, device type and device number
    // that will be used
    final int platformIndex = 0;
    final long deviceType = CL_DEVICE_TYPE_ALL;
    final int deviceIndex = 0;

    // Enable exceptions and subsequently omit error checks in this sample
    CL.setExceptionsEnabled(true);

    // Obtain the number of platforms
    int numPlatformsArray[] = new int[1];
    clGetPlatformIDs(0, null, numPlatformsArray);
    int numPlatforms = numPlatformsArray[0];

    // Obtain a platform ID
    cl_platform_id platforms[] = new cl_platform_id[numPlatforms];
    clGetPlatformIDs(platforms.length, platforms, null);
    cl_platform_id platform = platforms[platformIndex];

    // Initialize the context properties
    cl_context_properties contextProperties = new cl_context_properties();
    contextProperties.addProperty(CL_CONTEXT_PLATFORM, platform);

    // Obtain the number of devices for the platform
    int numDevicesArray[] = new int[1];
    clGetDeviceIDs(platform, deviceType, 0, null, numDevicesArray);
    int numDevices = numDevicesArray[0];

    // Obtain a device ID
    cl_device_id devices[] = new cl_device_id[numDevices];
    clGetDeviceIDs(platform, deviceType, numDevices, devices, null);
    cl_device_id device = devices[deviceIndex];

    // Create a context for the selected device
    context = clCreateContext(
        contextProperties, 1, new cl_device_id[]{device},
        null, null, null);

    String deviceName = getString(devices[0], CL_DEVICE_NAME);
    System.out.printf("CL_DEVICE_NAME: %s\n", deviceName);

    // Create a command-queue
    commandQueue = clCreateCommandQueue(
        context, devices[0], 0, null);

    }

    /**
     * Print the given buffer as a matrix with the given number of columns.
     * This assumes that the the elements of these buffers are complex 
     * numbers, consisting of a real- and an imaginary part.
     *
     * @param data The buffer
     * @param columns The number of columns
     */
    private static void printComplex2D(FloatBuffer data, int columns)
    {
    StringBuffer sb = new StringBuffer();
    for (int i=0; i<data.capacity() / 2; i++)
    {
        sb.append(String.format(Locale.ENGLISH, "(%5.1f, %5.1fi) ",
        data.get(i * 2 + 0), data.get(i * 2 + 1)));
        if (((i + 1) % columns) == 0)
        {
        sb.append("\n");
        }
    }
    System.out.print(sb.toString());
    }

    private static String getString(cl_device_id device, int paramName)
    {
    // Obtain the length of the string that will be queried
    long size[] = new long[1];
    clGetDeviceInfo(device, paramName, 0, null, size);

    // Create a buffer of the appropriate size and fill it with the info
    byte buffer[] = new byte[(int)size[0]];
    clGetDeviceInfo(device, paramName, buffer.length, 
        Pointer.to(buffer), null);

    // Create a string from the buffer (excluding the trailing \0 byte)
    return new String(buffer, 0, buffer.length-1);
    }

}

Both seem to work well (although I'll have to dive deeper into what OverrideParameters actually does to be sure that it has the intended effect, I received some error messages from the OpenCL compiler when I called it with wrong parameters, so it at least does have an effect ;-)).

I still have to create a GitHub repo for all the JOCL samples, so that I can finally summarize the examples from http://jocl.org/samples/samples.html and the ones that are posted elsewhere (in the forum and here) in one place....

blueberry commented 7 years ago

@gpu @amherag Here is the linux build for 0.11.0. Everything went smoothly. jocl-blast-0.11.0-SNAPSHOT.zip

gpu commented 7 years ago

(EDIT: Writing this overlapped with the comment at https://github.com/gpu/JOCLBlast/issues/9#issuecomment-303222830 )

I have done a small update for https://github.com/gpu/JOCLBlast/issues/9#issuecomment-303222495

Although technically, it should not change anything for the linux version, it might be clearer if the linux version would also be compiled based on this state. (The change might still cause issues on Linux - although, of course, it should not, but just to be sure...)

amherag commented 7 years ago

Here it is :)

jocl-blast-0.11.0-SNAPSHOT.jar.zip

blueberry commented 7 years ago

And the linux build is also ready: jocl-blast-0.11.0-SNAPSHOT-22-5-2017.zip

gpu commented 7 years ago

You're great! I'll build the Maven package ASAP (maybe tomorrow, but most likely not later than thursday)

gpu commented 7 years ago

Thanks again to @amherag and @blueberry (and @CNugteren , for making all this possible in the first place ;-) )

The release will soon be available as

<dependency>
    <groupId>org.jocl</groupId>
    <artifactId>jocl-blast</artifactId>
    <version>0.11.0</version>
</dependency>

blueberry commented 7 years ago

https://github.com/CNugteren/CLBlast/releases/tag/1.0.0

amherag commented 7 years ago

@blueberry @gpu Done!

jocl-blast-0.11.1-SNAPSHOT.jar.zip

blueberry commented 7 years ago

@amherag Hi Amaury. I'm afraid that we first have to wait for @gpu to update JOCLBlast to the newest CLBlast 1.0 :)

amherag commented 7 years ago

@blueberry Yeah, I was wondering why the versions didn't match. I was going to update my comment, but I decided to wait and see what you or @gpu were going to tell me :P

gpu commented 7 years ago

Thanks for the heads-up. Apart from the *AMIN functions, there seem to be no changes in the API. I'll try to schedule the update ASAP (I'm a bit short on time this week, but will see what I can do)

blueberry commented 7 years ago

Thank you, @gpu

CNugteren commented 7 years ago

Thanks again everyone! There was a bug fixed just after the release though, so I'll make a 1.0.1 release soon after (next week after everything is properly checked this time). Perhaps you should wait for that?

blueberry commented 7 years ago

@CNugteren @gpu I'd prefer to wait for the proper release, as I am in no hurry. Thanks everyone!

gpu commented 7 years ago

Yes, that sounds like a plan :-)

CNugteren commented 7 years ago

New 1.0.1 release is now made, sorry for any inconvenience. Greatly appreciate your effort with JOCLBlast!

gpu commented 7 years ago

These efforts are nothing compared to the efforts that went into CLBlast itself 👍

(I'll do the update on Sunday/Monday and drop a note here)

gpu commented 7 years ago

Although it's already tuesday now, here is the tag for the 1.0.1 release:

https://github.com/gpu/JOCLBlast/releases/tag/1.0.1-RC00

@blueberry and @amherag Once the natives for JOCLBlast and CLBlast are available, I'll publish the Maven release.

(BTW: This issue is already rather long. I'd probably close this after the release, so that we can use dedicated issues for the subsequent releases)

blueberry commented 7 years ago

I will be able to buid it and test it only in a few weeks. I hope that is ok. Sorry.

gpu commented 7 years ago

OK for me. Maybe that's a chance for me to try and build this on a VirtualBox VM. This should work, but not being able to really test the resulting library would cause me to hesitate publishing it.

(Maybe I can build it on a VM, and you can try out whether the resulting lib works on a real machine. If it does, I could build the linux libs myself in the future)

blueberry commented 7 years ago

Just testing whether it works wouldn't be that time-consuming for me, but the thing is that Cedric committed new tuning results for the GPU that I use from another user that tuned it with a newer GPU. However, that user was getting some results that were suspicious to me, so I need to investigate this and make some measurements to see whether these new changes do not introduce some noticeable performance regressions on my hardware (R9 290X)...

blueberry commented 7 years ago

@gpu Hi Marco. I've finally come around to building JOCLBlast 1.0.1 for Linux. Sorry for the delay.

@amherag reminder :)

jocl-blast-1.0.1-SNAPSHOT.zip

amherag commented 7 years ago

@blueberry Thanks for the reminder :D

jocl-blast-1.0.1-SNAPSHOT.jar.zip

gpu / JOCLBlast

Official releases #8