Closed blueberry closed 7 years ago
@gpu Believe it or not, i have been doing some windows build work for the atlas-based CPU engine of neanderthal, and this was the last peace of the puzzle for the neanderthal release. What coincidence :) Thank you, Marco and Amaury!
@blueberry This time, I was quicker :-) but only with noticing that there is a https://github.com/CNugteren/CLBlast/tree/0.10.0 . Unfortunately, it may still take a week or two until I can tackle the update, but considering that there seem to be no significant changes in the API, and the fixed bugs don't seem "critical" (from quickly skimming over the change log), I hope that this is OK.
There are no changes in the C++ API, which you are using right? There is now also a Netlib compatible API, but that's not recommended for performance. And there are some changes in the C API to make the chance of a name-clash less likely:
- Changed the enums in the C API to avoid potential name clashes with external code
- Added a Netlib CBLAS compatible API (not recommended for full control over performance)
I'm using the C API. And I noticed the part about renamed enums. Strictly speaking, this as a compatibility-breaking change in the API, but I'll think about how to deal with this when I start the update:
On the one hand, it will not be necessary to let this change become visible in the JOCLBlast layer (org.jocl.blast.Layout
is a unique name :-)). But usually, I try to follow the naming of the underlying library as closely as possible, so I might as well rename the corresponding classes.
@blueberry Any thoughts or preferences regarding this point?
OK, I understand. In that case yes, the C-API has changed such that it is not compatible with previous versions. I though this was a quite important change, better sooner than later. And it's still a pre-1.0 version :slightly_smiling_face:
@gpu I prefer the technically "better" solution over supporting legacy code, at least in the pre-1.0 versions. So, whatever changes you need or prefer to make, please make them, and I'll update neanderthal accordingly. Please just make a list of the changes, so I can be sure I updated all relevant parts.
@gpu @CNugteren What is the relevance of this netlib addition? It seems to me that they are irrelevant for JOCLBlast, and more like CLBlast's support for legacy code?
The Netlib API is really meant as a drop-in replacement and is not the main focus of the CLBlast project. It can actually yield very poor performance because of extra data copies (especially level 1 and level 2 routines), but can sometimes give a 'free' performance improvement over CPU code for level 3 routines. It basically calls the regular CLBlast API but does a device initialization and host-devices copies before and after.
The changes to the C-API are extra error codes and a 'CLBlast' prefix to all enums and constants (click on 'load diff') including for the new status codes.
@blueberry and @amherag : The tag for building the native libraries of version 0.10.0 has been added:
https://github.com/gpu/JOCLBlast/releases/tag/0.10.0-RC00
The changes are basically just following the CLBlast changes:
Diagonal
, Layout
, Side
, StatusCode
, Transpose
and Triangle
classes have been renamed to CLBlastDiagonal
, CLBlastLayout
, CLBlastSide
, CLBlastStatusCode
, CLBlastTranspose
and CLBlastTriangle
, respectively.enums
) have been renamed accordingly, from names like kNonUnit
to CLBlastDiagonalNonUnit
@CNugteren I was a bit confused when I saw that the Precision
enum was removed from the C interface header, but I assume that this was intentional, because it was not used in the C interface.
BTW: Recently, I considered creating a small utility library for handling the "half" data type in Java. CLBlast already has dedicated methods for this data type, and it is used in other libraries as well (most prominently in cuDNN, so I could use it for https://github.com/jcuda/jcudnn ...). I'm not sure about the performance implications, though: In Java, one could only have a float[]
array, and write it into a ShortBuffer
where each short
contains the 16 bits of the half
value that corresponds to the float
. This conversion is not for free. But maybe it would be compensated by the higher performance that half
may achieve internally...?
(Edited slightly by gpu)
Hi @gpu, @amherag, and @CNugteren I'll build this in three weeks. I'm sorry I won't be able to do it sooner, but I hope it won't cause any delay to the users of this fantastically useful library.
I wish you all happy hollidays!
@gpu Indeed, the Precision
enum was not really used at all in the API, so I removed it. Forgot to tell you, sorry!
About half-precision: In OpenCL on the host there is a cl_half
data-type, but it is just a different name for a 16-bits short and there are no operations possible on them. I include a small header clblast_half.h
in the CLBlast repository to do float-to-half and half-to-float conversions, but it is only used in the tests and samples and so on, not in the library itself. It is up to the user to do the conversion and up to the user to decide whether or not the conversion-cost is worth the faster computation. But for example in deep-learning values can stay 16-bits for a long time or might have never have to be converted to 32-bits at all: all arithmetic happens on the GPU.
Happy new year!
AFAIK, the half
data type can be enabled via an extension in OpenCL kernels. But I have to admit that I do not (yet) know much about its role. I think that nearly all GPUs will internally do the computations with float
anyhow. So the main purpose of half
is to save space for "large" arrays (e.g. matrices) for which the precision does not need to be so high, but it probably has no positive effect on performance.
The deep learning applications seem to be one field where the precision of half
is sufficient, and the memory savings are imporant. Recently, NVIDIA has added dedicated support for half
to their GPUs, mainly for deep learning (although I don't know much about the details here). In any case, a nice, convenient support for half
on Java side would be nice to have, and might have an increasing number of application cases in the near future.
Indeed, the half
data-type can be used to save space, which is also important for deep learning. But the latest GPUs now support half-precision (FP16) arithmetic at double the performance of single precision (FP32). This yields a direct 2x performance improvement for deep learning. Examples are NVIDIA's Pascal P100 and AMD's announced MI25 Vega GPU. But also several embedded GPUs can already do half-precision at twice the speed. Examples include ARM's Mali GPUs and Intel's GPU which you can find on-die with a CPU. See the CLBlast repository for an example benchmark.
@gpu @amherag linux build 0.10.0 is here: joclblast-0.10.0-linux.zip
EDIT: I noticed that I haven't included libclblast.so so I updated the archive. I also tuned it for nvidia-gtx-1080 (this will be included by default in the next release of clblast)
@gpu updated the build
@gpu @blueberry mac build 0.10.0: jocl-blast-0.10.0-SNAPSHOT.jar.zip
Sorry for the delay.
Version 0.10.0 has been released, and will be available in Maven Central soon.
Thanks @blueberry and @amherag for your contributions!
@gpu @amherag Just to notify you that CLBlast 0.11.0 has been released.
Do I use JOCLBlast RC 0.10.0 for this build?
2017-05-02 12:12 GMT-07:00 Dragan Djuric notifications@github.com:
@gpu https://github.com/gpu @amherag https://github.com/amherag Just to notify you that CLBlast 0.11.0 has been released.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gpu/JOCLBlast/issues/8#issuecomment-298732248, or mute the thread https://github.com/notifications/unsubscribe-auth/AAYq51WSSsdLN_q7nTqCkS4XoV8-EtcXks5r14AUgaJpZM4IZuSJ .
I guess we should wait for @gpu to prepare new JOCLBlast, even if the change is only a version number upgrade.
There are also some additions to the API: 2 new batched routines, one override-parameters function, and a couple new error codes. You can see the diff to the header here.
Thanks for the pointer. I'll try to do the update (early) next week, and drop you a note here.
The update is basically done, as of https://github.com/gpu/JOCLCommon/commit/b671b0cdd79123274b60072ccc6a328773dde117 and https://github.com/gpu/JOCLBlast/commit/0103ae6033e56e7b17879b372eb3d9af24a10f88
I'd like to test the new functionalities, e.g. the batched routines and this "parameter override" thingy. I already started a JOCLBlastCaxpySample
, and will try to create an example for the parameter overriding as well.
So... I'm not sure: @blueberry and @amherag You could create the libraries from the current state, although this is not (yet) tagged as a "release candidate". If there are any problems with this state, I'd have to update it, but I would like to avoid declaring the current (untested) state as a "RC"...
No problem, I'll wait for it to be ready. Do you have an approximate time estimate fot the RC?
I'll try to create the samples tomorrow (not sure how to test this "parameter override", but at least a batched example), so hopefully, the RC tag can be created tomorrow as well.
There is a small parameter override test in CLBlast, maybe that will help you: https://github.com/CNugteren/CLBlast/blob/master/test/correctness/misc/override_parameters.cpp
@blueberry and @amherag The RC tag for 0.11.0 is at https://github.com/gpu/JOCLBlast/releases/tag/0.11.0-RC00
@CNugteren Thanks. I have created a "simplified port" of this class for testing the OverrideParameters
functionality in JOCLBlast:
package org.jocl.samples.blast;
import static org.jocl.CL.*;
import java.util.*;
import org.jocl.*;
import org.jocl.blast.*;
/**
* An example for using the OverrideParameters functionality of CLBlast.
*
* This example is basically a (simplified) port of the original test at
* https://github.com/CNugteren/CLBlast/blob/
* f24c142948fc71d8b37826c1275259668fe0d0e5/test/
* correctness/misc/override_parameters.cpp
*
*/
public class JOCLBlastOverrideTest
{
// The platform, device type and device number
// that will be used
static final int platformIndex = 0;
static final long deviceType = CL_DEVICE_TYPE_ALL;
static final int deviceIndex = 0;
private static cl_device_id device;
private static cl_context context;
private static cl_command_queue commandQueue;
public static void main(String[] args)
{
int errors = 0;
int passed = 0;
int kSeed = 42; // fixed seed for reproducibility
// Determines the test settings
String routine_name = "SGEMM";
String kernel_name = "Xgemm";
int precision = CLBlastPrecision.CLBlastPrecisionSingle;
List<Map<String, Integer>> valid_settings = createValidSettings();
List<Map<String, Integer>> invalid_settings = createInvalidSettings();
// Retrieves the arguments
int m = 256;
int n = 256;
int k = 256;
int a_ld = k;
int b_ld = n;
int c_ld = n;
int a_offset = 0;
int b_offset = 0;
int c_offset = 0;
int layout = CLBlastLayout.CLBlastLayoutRowMajor;
int a_transpose = CLBlastTranspose.CLBlastTransposeNo;
int b_transpose = CLBlastTranspose.CLBlastTransposeNo;
float alpha = 0.0f;
float beta = 0.0f;
// Initialize OpenCL
defaultInitialization();
// Populate host matrices with some example data
float host_a[] = new float[m * k];
float host_b[] = new float[n * k];
float host_c[] = new float[m * n];
Random random = new Random(kSeed);
populateVector(host_a, random);
populateVector(host_b, random);
populateVector(host_c, random);
// Copy the matrices to the device
cl_mem device_a = copyToDevice(host_a);
cl_mem device_b = copyToDevice(host_b);
cl_mem device_c = copyToDevice(host_c);
System.out.printf(
"* Testing OverrideParameters for '%s'\n", routine_name);
// Loops over the valid combinations: run before and run afterwards
for (Map<String, Integer> override_setting : valid_settings)
{
// Call with the default parameters
int status_before = CLBlast.CLBlastSgemm(
layout, a_transpose, b_transpose, m,
b_transpose, k, alpha, device_a, a_offset,
a_ld, device_b, b_offset, b_ld, beta,
device_c, c_offset, c_ld, commandQueue, null);
CL.clFinish(commandQueue);
if (status_before != CLBlastStatusCode.CLBlastSuccess)
{
errors++;
continue;
}
// Overrides the parameters
int num_parameters = override_setting.size();
String parameters_names[] =
override_setting.keySet().toArray(new String[0]);
long[] parameters_values =
extractParameterValues(override_setting.values());
int status = CLBlast.CLBlastOverrideParameters(
device, kernel_name, precision, num_parameters,
parameters_names, parameters_values);
if (status != CLBlastStatusCode.CLBlastSuccess)
{
errors++;
continue;
}
// Call with the overridden parameters
int status_after = CLBlast.CLBlastSgemm(
layout, a_transpose, b_transpose, m,
b_transpose, k, alpha, device_a, a_offset,
a_ld, device_b, b_offset, b_ld, beta,
device_c, c_offset, c_ld, commandQueue, null);
CL.clFinish(commandQueue);
if (status_after != CLBlastStatusCode.CLBlastSuccess)
{
errors++;
continue;
}
passed++;
}
// Loops over the valid combinations: run before and run afterwards
for (Map<String, Integer> override_setting : invalid_settings)
{
// Call with the default parameters
int status_before = CLBlast.CLBlastSgemm(
layout, a_transpose, b_transpose, m,
b_transpose, k, alpha, device_a, a_offset,
a_ld, device_b, b_offset, b_ld, beta,
device_c, c_offset, c_ld, commandQueue, null);
CL.clFinish(commandQueue);
if (status_before != CLBlastStatusCode.CLBlastSuccess)
{
errors++;
continue;
}
// Overrides the parameters
int num_parameters = override_setting.size();
String parameters_names[] =
override_setting.keySet().toArray(new String[0]);
long[] parameters_values =
extractParameterValues(override_setting.values());
int status = CLBlast.CLBlastOverrideParameters(
device, kernel_name, precision, num_parameters,
parameters_names, parameters_values);
if (status == CLBlastStatusCode.CLBlastSuccess) // expecting error
{
errors++;
continue;
}
// Call again (using the default parameters)
int status_after = CLBlast.CLBlastSgemm(
layout, a_transpose, b_transpose, m,
b_transpose, k, alpha, device_a, a_offset,
a_ld, device_b, b_offset, b_ld, beta,
device_c, c_offset, c_ld, commandQueue, null);
CL.clFinish(commandQueue);
if (status_after != CLBlastStatusCode.CLBlastSuccess)
{
errors++;
continue;
}
passed++;
}
// Print the statistics
System.out.printf(" %d test(s) passed\n", passed);
System.out.printf(" %d test(s) failed\n", errors);
System.out.printf("\n");
}
private static List<Map<String, Integer>> createValidSettings()
{
List<Map<String, Integer>> validSettings =
new ArrayList<Map<String, Integer>>();
Map<String, Integer> map = null;
map = new LinkedHashMap<String, Integer>();
map.put("KWG",16);
map.put("KWI",2);
map.put("MDIMA",4);
map.put("MDIMC",4);
map.put("MWG",16);
map.put("NDIMB",4);
map.put("NDIMC",4);
map.put("NWG",16);
map.put("SA",0);
map.put("SB",0);
map.put("STRM",0);
map.put("STRN",0);
map.put("VWM",1);
map.put("VWN",1);
validSettings.add(map);
map = new LinkedHashMap<String, Integer>();
map.put("KWG",32);
map.put("KWI",2);
map.put("MDIMA",4);
map.put("MDIMC",4);
map.put("MWG",32);
map.put("NDIMB",4);
map.put("NDIMC",4);
map.put("NWG",32);
map.put("SA",0);
map.put("SB",0);
map.put("STRM",0);
map.put("STRN",0);
map.put("VWM",1);
map.put("VWN",1);
validSettings.add(map);
return validSettings;
}
private static List<Map<String, Integer>> createInvalidSettings()
{
List<Map<String, Integer>> invalidSettings =
new ArrayList<Map<String, Integer>>();
Map<String, Integer> map = null;
map = new LinkedHashMap<String, Integer>();
map.put("KWI",2);
map.put("MDIMA",4);
map.put("MDIMC",4);
map.put("MWG",16);
map.put("NDIMB",4);
map.put("NDIMC",4);
map.put("NWG",16);
map.put("SA",0);
invalidSettings.add(map);
return invalidSettings;
}
private static long[] extractParameterValues(Collection<Integer> integers)
{
long result[] = new long[integers.size()];
int index = 0;
for (Integer integer : integers)
{
result[index] = integer;
index++;
}
return result;
}
private static void populateVector(float a[], Random random)
{
for (int i=0; i<a.length; i++)
{
a[i] = random.nextFloat();
}
}
private static cl_mem copyToDevice(float host[])
{
cl_mem device = clCreateBuffer(context, CL_MEM_READ_WRITE,
host.length * Sizeof.cl_float, null, null);
clEnqueueWriteBuffer(commandQueue, device, CL_TRUE, 0,
host.length * Sizeof.cl_float,
Pointer.to(host), 0, null, null);
return device;
}
/**
* Default OpenCL initialization of the device, context and command queue
*/
private static void defaultInitialization()
{
// Enable exceptions and subsequently omit error checks in this sample
CL.setExceptionsEnabled(true);
// Obtain the number of platforms
int numPlatformsArray[] = new int[1];
clGetPlatformIDs(0, null, numPlatformsArray);
int numPlatforms = numPlatformsArray[0];
// Obtain a platform ID
cl_platform_id platforms[] = new cl_platform_id[numPlatforms];
clGetPlatformIDs(platforms.length, platforms, null);
cl_platform_id platform = platforms[platformIndex];
// Initialize the context properties
cl_context_properties contextProperties = new cl_context_properties();
contextProperties.addProperty(CL_CONTEXT_PLATFORM, platform);
// Obtain the number of devices for the platform
int numDevicesArray[] = new int[1];
clGetDeviceIDs(platform, deviceType, 0, null, numDevicesArray);
int numDevices = numDevicesArray[0];
// Obtain a device ID
cl_device_id devices[] = new cl_device_id[numDevices];
clGetDeviceIDs(platform, deviceType, numDevices, devices, null);
device = devices[deviceIndex];
// Create a context for the selected device
context = clCreateContext(
contextProperties, 1, new cl_device_id[]{device},
null, null, null);
String deviceName = getString(devices[0], CL_DEVICE_NAME);
System.out.printf("CL_DEVICE_NAME: %s\n", deviceName);
// Create a command-queue
commandQueue = clCreateCommandQueue(
context, devices[0], 0, null);
}
private static String getString(cl_device_id device, int paramName)
{
// Obtain the length of the string that will be queried
long size[] = new long[1];
clGetDeviceInfo(device, paramName, 0, null, size);
// Create a buffer of the appropriate size and fill it with the info
byte buffer[] = new byte[(int)size[0]];
clGetDeviceInfo(device, paramName, buffer.length,
Pointer.to(buffer), null);
// Create a string from the buffer (excluding the trailing \0 byte)
return new String(buffer, 0, buffer.length-1);
}
}
Also, a small test/example for the CLBlastCaxpyBatched
function:
package org.jocl.samples.blast;
import static org.jocl.CL.*;
import static org.jocl.blast.CLBlast.CLBlastCaxpyBatched;
import java.nio.FloatBuffer;
import java.util.Locale;
import org.jocl.*;
import org.jocl.blast.CLBlast;
/**
* An example for using the batched CAXPY function from CLBlast to compute
* Y = a * X + Y
* for several single-precision complex number vectors
*/
public class JOCLBlastCaxpyBatchedSample
{
private static cl_context context;
private static cl_command_queue commandQueue;
/**
* The entry point of this sample
*
* @param args Not used
*/
public static void main(String args[])
{
CL.setExceptionsEnabled(true);
CLBlast.setExceptionsEnabled(true);
defaultInitialization();
// Create the host input data. Each entry of these vectors consists
// of TWO values, which are the real- and imaginary part of the
// complex number
int numVectors = 3;
int vectorSize = 5;
// 3 vectors, each with 5 dimensions (*2, for real- and imaginary part)
float X[] =
{
1,1, 1,2, 1,3, 1,4, 1,5,
2,1, 2,2, 2,3, 2,4, 2,5,
3,1, 3,2, 3,3, 3,4, 3,5,
};
// 3 vectors, each with 5 dimensions (*2, for real- and imaginary part)
float Y[] =
{
4,1, 4,2, 4,3, 4,4, 4,5,
5,1, 5,2, 5,3, 5,4, 5,5,
6,1, 6,2, 6,3, 6,4, 6,5,
};
// Create the device input buffers
cl_mem memX = clCreateBuffer(context, CL_MEM_READ_ONLY,
vectorSize * numVectors * Sizeof.cl_float2, null, null);
cl_mem memY = clCreateBuffer(context, CL_MEM_READ_ONLY,
vectorSize * numVectors * Sizeof.cl_float2, null, null);
// Copy the host data to the device
clEnqueueWriteBuffer(commandQueue, memX, CL_TRUE, 0,
vectorSize * numVectors * Sizeof.cl_float2,
Pointer.to(X), 0, null, null);
clEnqueueWriteBuffer(commandQueue, memY, CL_TRUE, 0,
vectorSize * numVectors * Sizeof.cl_float2,
Pointer.to(Y), 0, null, null);
// 3 factors to be multiplied with X (*2, for real- and imaginary part)
float alphas[] = { 1,2, 2,3, 3,4 };
// Execute batched CAXPY: Y = alpha * X + Y
cl_event event = new cl_event();
CLBlastCaxpyBatched(vectorSize, alphas,
memX, new long[] { 0, 5, 10 }, 1,
memY, new long[] { 0, 5, 10 }, 1,
numVectors, commandQueue, event);
// Wait for the computation to be finished
clWaitForEvents( 1, new cl_event[] { event });
// Copy the result data back to the host
float resultY[] = new float[vectorSize * numVectors * 2];
clEnqueueReadBuffer(commandQueue, memY, CL_TRUE, 0,
vectorSize * numVectors * Sizeof.cl_float2,
Pointer.to(resultY), 0, null, null);
// Print the inputs and the result
System.out.println("a:");
printComplex2D(FloatBuffer.wrap(alphas), 1);
System.out.println("X:");
printComplex2D(FloatBuffer.wrap(X), vectorSize);
System.out.println("Y:");
printComplex2D(FloatBuffer.wrap(Y), vectorSize);
System.out.println("Result:");
printComplex2D(FloatBuffer.wrap(resultY), vectorSize);
// Clean up
clReleaseMemObject(memX);
clReleaseMemObject(memY);
clReleaseCommandQueue(commandQueue);
clReleaseContext(context);
}
/**
* Default OpenCL initialization of the context and command queue
*/
private static void defaultInitialization()
{
// The platform, device type and device number
// that will be used
final int platformIndex = 0;
final long deviceType = CL_DEVICE_TYPE_ALL;
final int deviceIndex = 0;
// Enable exceptions and subsequently omit error checks in this sample
CL.setExceptionsEnabled(true);
// Obtain the number of platforms
int numPlatformsArray[] = new int[1];
clGetPlatformIDs(0, null, numPlatformsArray);
int numPlatforms = numPlatformsArray[0];
// Obtain a platform ID
cl_platform_id platforms[] = new cl_platform_id[numPlatforms];
clGetPlatformIDs(platforms.length, platforms, null);
cl_platform_id platform = platforms[platformIndex];
// Initialize the context properties
cl_context_properties contextProperties = new cl_context_properties();
contextProperties.addProperty(CL_CONTEXT_PLATFORM, platform);
// Obtain the number of devices for the platform
int numDevicesArray[] = new int[1];
clGetDeviceIDs(platform, deviceType, 0, null, numDevicesArray);
int numDevices = numDevicesArray[0];
// Obtain a device ID
cl_device_id devices[] = new cl_device_id[numDevices];
clGetDeviceIDs(platform, deviceType, numDevices, devices, null);
cl_device_id device = devices[deviceIndex];
// Create a context for the selected device
context = clCreateContext(
contextProperties, 1, new cl_device_id[]{device},
null, null, null);
String deviceName = getString(devices[0], CL_DEVICE_NAME);
System.out.printf("CL_DEVICE_NAME: %s\n", deviceName);
// Create a command-queue
commandQueue = clCreateCommandQueue(
context, devices[0], 0, null);
}
/**
* Print the given buffer as a matrix with the given number of columns.
* This assumes that the the elements of these buffers are complex
* numbers, consisting of a real- and an imaginary part.
*
* @param data The buffer
* @param columns The number of columns
*/
private static void printComplex2D(FloatBuffer data, int columns)
{
StringBuffer sb = new StringBuffer();
for (int i=0; i<data.capacity() / 2; i++)
{
sb.append(String.format(Locale.ENGLISH, "(%5.1f, %5.1fi) ",
data.get(i * 2 + 0), data.get(i * 2 + 1)));
if (((i + 1) % columns) == 0)
{
sb.append("\n");
}
}
System.out.print(sb.toString());
}
private static String getString(cl_device_id device, int paramName)
{
// Obtain the length of the string that will be queried
long size[] = new long[1];
clGetDeviceInfo(device, paramName, 0, null, size);
// Create a buffer of the appropriate size and fill it with the info
byte buffer[] = new byte[(int)size[0]];
clGetDeviceInfo(device, paramName, buffer.length,
Pointer.to(buffer), null);
// Create a string from the buffer (excluding the trailing \0 byte)
return new String(buffer, 0, buffer.length-1);
}
}
Both seem to work well (although I'll have to dive deeper into what OverrideParameters
actually does to be sure that it has the intended effect, I received some error messages from the OpenCL compiler when I called it with wrong parameters, so it at least does have an effect ;-)).
I still have to create a GitHub repo for all the JOCL samples, so that I can finally summarize the examples from http://jocl.org/samples/samples.html and the ones that are posted elsewhere (in the forum and here) in one place....
@gpu @amherag Here is the linux build for 0.11.0. Everything went smoothly. jocl-blast-0.11.0-SNAPSHOT.zip
(EDIT: Writing this overlapped with the comment at https://github.com/gpu/JOCLBlast/issues/9#issuecomment-303222830 )
I have done a small update for https://github.com/gpu/JOCLBlast/issues/9#issuecomment-303222495
Although technically, it should not change anything for the linux version, it might be clearer if the linux version would also be compiled based on this state. (The change might still cause issues on Linux - although, of course, it should not, but just to be sure...)
Here it is :)
And the linux build is also ready: jocl-blast-0.11.0-SNAPSHOT-22-5-2017.zip
You're great! I'll build the Maven package ASAP (maybe tomorrow, but most likely not later than thursday)
Thanks again to @amherag and @blueberry (and @CNugteren , for making all this possible in the first place ;-) )
The release will soon be available as
<dependency>
<groupId>org.jocl</groupId>
<artifactId>jocl-blast</artifactId>
<version>0.11.0</version>
</dependency>
@blueberry @gpu Done!
@amherag Hi Amaury. I'm afraid that we first have to wait for @gpu to update JOCLBlast to the newest CLBlast 1.0 :)
@blueberry Yeah, I was wondering why the versions didn't match. I was going to update my comment, but I decided to wait and see what you or @gpu were going to tell me :P
Thanks for the heads-up. Apart from the *AMIN
functions, there seem to be no changes in the API. I'll try to schedule the update ASAP (I'm a bit short on time this week, but will see what I can do)
Thank you, @gpu
Thanks again everyone! There was a bug fixed just after the release though, so I'll make a 1.0.1 release soon after (next week after everything is properly checked this time). Perhaps you should wait for that?
@CNugteren @gpu I'd prefer to wait for the proper release, as I am in no hurry. Thanks everyone!
Yes, that sounds like a plan :-)
New 1.0.1 release is now made, sorry for any inconvenience. Greatly appreciate your effort with JOCLBlast!
These efforts are nothing compared to the efforts that went into CLBlast itself 👍
(I'll do the update on Sunday/Monday and drop a note here)
Although it's already tuesday now, here is the tag for the 1.0.1 release:
https://github.com/gpu/JOCLBlast/releases/tag/1.0.1-RC00
@blueberry and @amherag Once the natives for JOCLBlast and CLBlast are available, I'll publish the Maven release.
(BTW: This issue is already rather long. I'd probably close this after the release, so that we can use dedicated issues for the subsequent releases)
I will be able to buid it and test it only in a few weeks. I hope that is ok. Sorry.
OK for me. Maybe that's a chance for me to try and build this on a VirtualBox VM. This should work, but not being able to really test the resulting library would cause me to hesitate publishing it.
(Maybe I can build it on a VM, and you can try out whether the resulting lib works on a real machine. If it does, I could build the linux libs myself in the future)
Just testing whether it works wouldn't be that time-consuming for me, but the thing is that Cedric committed new tuning results for the GPU that I use from another user that tuned it with a newer GPU. However, that user was getting some results that were suspicious to me, so I need to investigate this and make some measurements to see whether these new changes do not introduce some noticeable performance regressions on my hardware (R9 290X)...
@gpu Hi Marco. I've finally come around to building JOCLBlast 1.0.1 for Linux. Sorry for the delay.
@amherag reminder :)
@blueberry Thanks for the reminder :D
Since CLBlast 0.7.0 is out, maybe we can prepare the release 0.7.0 of JOCLBlast (and also RC01 of JOCL)? We have people that can build for all 3 major operating systems...