microsoft / OpenCLOn12

The OpenCL-on-D3D12 mapping layer
MIT License
104 stars 13 forks source link

Result is not correct with Octave-Ocl package. #30

Closed tangjinchuan closed 1 year ago

tangjinchuan commented 2 years ago

Dear MS OpenCL team,

I tried to use your OpenCL driver on DX 12 devices with Octave 6.3 + Octave-Ocl OpenCL package (An opensource alternative to Matlab's GPU parallel toolbox, such as dropin replacement for gpuArray and many other functions.https://gnu-octave.github.io/packages/ocl) However, it could not pass the test with the help of cmd :ocl_tests or cat function offered by Octave-Ocl.

After install the ocl package with pkg install from https://gnu-octave.github.io/packages/ocl. You can use

  1. pkg load ocl % to load ocl module on Octave
  2. ocl_context('get_resources') % to find all the available gpu runtime
  3. ocl_context('device_selection','GPU2') % where 2 can be changed manually to match the number of your GPU runtime read from get_resources. Start from 0. And should not count the number of CPU devices.
  4. aa= ocl_ones(3,5,'single') % create 3x5 gpuArray with ones
  5. bb = ocl_cat(1,aa,aa):% cat the same array vertically
  6. gather(bb); % get the answer from GPU memory to host memory. This will give abnormal results with your runtime rather than the runtime from Nvidia or Intel.

For example :

pkg load ocl

ocl_context('device_selection','GPU2') aa= ocl_ones(3,5,'single')

aa = 2-dimensional OCL array (3x5) of class single (float)

ocl_cat(1,aa,aa)

ans = 2-dimensional OCL array (6x5) of class single (float)

gather(ans)

ans =

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

The result is not right.

Best wishes, Jinchuan Tang

tangjinchuan commented 2 years ago

Dear MS team, I would like to give my observe about this problem. This problem still exists with latest version 1.2112.2.0 as published 16 days ago.

One important thing I discovered was that If a kernel have been enqueued twice or more (e.g., for loop in cat kernel submission), and if none of it is being executed at the device side, then this OpenCLOn12 runtime will ignore the previously queued kernel but only leave the latest one who has the same kernel name. This causes the incorrect results while doing examples like ocl_cat(1,a,a,a).

To fix this, we have to add clFlush between two kernel submissions if their names are the same.

For more info: https://sourceforge.net/p/octave-ocl/discussion/general/thread/b47bc925d3/

tangjinchuan commented 2 years ago

Hi, please see the attached test case to reproduce the results. I create a simple opencl program with codeblocks to simulate a catting-two-array behavior. Array 'a' has a length of 3 while array 'c' has alength of 6. The former is initialized to 1 while the latter is initialized to 0. Then, the program issues two kernels by a loop to concat two array 'a' into one array c. Then the program sums the whole array c, which should produce 6. 6 is obtained by using intel opencl gpu driver, while 3 is obtained with openclondx12. Note: to select the correct gpu device (like openclondx12 driver), you need to change line 83 to the correct platform id: err = clGetDeviceIDs(cpPlatform[0], CL_DEVICE_TYPE_GPU, 1, &device_id, NULL); I tried other kernels like adding several times they issued with for loop and have no problem. Thank you very much! Best wishes, Jinchuan Tang ts.zip ts.zip

jenatali commented 1 year ago

Sorry for the long delay on this. This was a good bug. It was broken by https://github.com/microsoft/OpenCLOn12/commit/56608692b1ff75dde5b5345762d9eeeaac24597d, which was in turn a fix for a Photoshop effect which had incorrect rendering. By moving the constant buffer data upload from the app thread kernel enqueue, to the worker thread submitting to D3D, I was no longer capturing the app's private memory-space kernel args at the right time, so both kernels ended up with the last one's args.