Example from readme has zero output on GTX 1060 Max-Q

KarelPeeters commented 3 years ago

I'm getting started with OpenCL, and I'm trying out this crate. I started with the example in the readme. I had to add some extra code at the top to pick a device and define svm_capability, I hope I did that correctly. I also print a few bits of debug info about the device, in the end this is the full code that I run:

https://gist.github.com/KarelPeeters/74a864d42af4b1be5f8d8b4a5aed675d

The output of this code is:

Found platform NVIDIA CUDA
Found platform Intel(R) OpenCL HD Graphics
Using device "NVIDIA GeForce GTX 1060 with Max-Q Design"
  version: OpenCL 3.0 CUDA
  c version: OpenCL C 1.2 
  svm_cap: 1
sum results: [0, 0, 0, 0, 0, 0, 0, 0]

When I change the platform to Intel, I get this:

Found platform NVIDIA CUDA
Found platform Intel(R) OpenCL HD Graphics
Using device "Intel(R) HD Graphics 630"
  version: OpenCL 3.0 NEO 
  c_version: OpenCL C 3.0 
  svm_cap: 11
sum results: [3, 5, 10, 19, 26, 27, 31, 33]

Clearly there are a bunch of differences between both platforms, and unfortunately the result is wrong on my actual GPU. Are there other things I need to change in the example to get it to work correctly? I tried changing the kernel to just fill the output with 1s, but the output was still all zeros, so I assume there is some SVM issue.

As I said I'm pretty new to OpenCL, so I really have no idea where to start. Feel free to ask for extra information if I forgot anything!

kenba commented 3 years ago

Hi Karel, I'm sorry to say that it looks like your Nvidia graphics card does not support SVM which is a key OpenCL 2.0 feature.

Nvidia's support for OpenCL has always been poor, as this 5 year old article shows: https://forums.developer.nvidia.com/t/opencl-2-x-support-plans/44215. Many people believe that it is a deliberate strategy by Nvidia to encourage developers to use their (proprietary) CUDA platform.

The OpenCL 3.0 standard allows device manufacturers like Nvidia to claim support for OpenCL 3.0, without providing OpenCL 2.0 features like SVM. Maybe Khronos having an Nvidia employee as it's president has something to do with it. AMD and Intel have much better support for OpenCL, as the example working on your Intel integrated graphics device shows.

In short, it is highly unlikely that you will be able to use SVM on your Nvidia graphics card anytime soon. Therefore, I recommend that you follow the test_opencl_1_2_example in the integration_test.rs file, which uses OpenCL 1.2 buffers instead of SVM to work on OpenCL 1.2 hardware like Nvidia.

KarelPeeters commented 3 years ago

Ah that's unfortunate, and indeed test_opencl_1_2_example does seem to work, I'll try to continue with the older OpenCL features then. I'm switching from CUDA to OpenCL specifically to avoid the lock-in, so at least I'm doing my part 🙂 .

It's strange that this check inside SvmVec passes, even though actually using it doesn't work. Is that still a bug in this crate or should I check this stuff myself before trying to instantiate an SvmVec?

https://github.com/kenba/opencl3/blob/71c8b8594b49f2bf43d88b1f1b4fe2552bd52273/src/svm.rs#L45-L49

kenba commented 3 years ago

I'm sorry Karel, but my previous answer was incorrect: your Nvidia graphics card does support SVM.
It supports CL_DEVICE_SVM_COARSE_GRAIN_BUFFER, see CL_DEVICE_SVM_CAPABILITIES.

Therefore, you can still use an SvmVec, however you must also use clEnqueueMapBuffer and clEnqueueUnmapMemObject as described in the OpenCL 3.0 API above.

The example in the readme is for CL_DEVICE_SVM_FINE_GRAIN_BUFFER (as supported by your Intel graphics device) which does not require calls to clEnqueueMapBuffer and clEnqueueUnmapMemObject.

Note: you can call clEnqueueMapBuffer and clEnqueueUnmapMemObject via the CommandQueue methods: enqueue_map_buffer and enqueue_unmap_mem_object respectively.

Finally, the check inside SvmVec is correct. However, you should always "check this stuff yourself before trying to instantiate an SvmVec", because otherwise the assert will panic and crash your program! ;)

KarelPeeters commented 3 years ago

Are you sure it's possible to use enqueue_map_buffer for this? It takes a &Buffer<T>, which I don't have. I tried it by manually calling cl3::command_queue::enqueue_map_buffer, which looks like this:

// Copy input data into an OpenCL SVM vector
let mut test_values = SvmVec::<cl_int>::with_capacity(&context, svm_capability, ARRAY_SIZE);

let mut buffer_ptr = std::ptr::null_mut();
let status = cl3::command_queue::enqueue_map_buffer(
    queue.get(),
    test_values.as_ptr() as cl_mem,
    CL_TRUE,
    CL_MAP_WRITE,
    0,
    test_values.len() * std::mem::size_of::<cl_int>(),
    &mut buffer_ptr,
    0,
    std::ptr::null(),
)?;
println!("{:?}", buffer_ptr);
println!("{:?}", status);

for &val in value_array.iter() {
    test_values.push(val);
}

//TODO unmap again

But that yields ClError(-34), ie. CL_INVALID_CONTEXT, I assume because the Svm is not actually a buffer. It's not super important since I can just use actual Buffers like in the other example.

kenba commented 3 years ago

Sorry Karel but I'm not sure that it's possible to use enqueue_map_buffer. I took that from the OpenCL 3.0 API, which goes into more detail about it here: SVM sharing granularity.

Unfortunately, I've not used coarse grained SVM and I think that you are correct that clEnqueueMapBuffer does not work because SvmVec is not actually an SVM buffer. You may have to use the value returned from clSVMAlloc directly, instead of in the wrapper provided by SvmVec.

The main benefits of SVM come with fine grained sharing where you don't need to call the map and unmap functions.

KarelPeeters commented 3 years ago

I'll just continue with Buffers for now, it's not that big of a deal.

Is there any situation it which it makes sense to instantiate an SvmVec on a system that does not support fine-grained SVM? Maybe the CL_DEVICE_SVM_COARSE_GRAIN_BUFFER | part in the assert I linked above can just be removed?

kenba commented 3 years ago

@KarelPeeters yes I'm sure that there are many situations where you may want to instantiate an SvmVec on a system that does not support fine-grained SVM.

Please see #33 and test_opencl_svm_example in integration_test.rs for an example on how to use coarse grained SVM in your system. I shall update the documentation in due course.

kenba commented 3 years ago

@KarelPeeters I've just incorporated changes to fix this issue in version 0.5 of the library on crates.io.
If you are satisfied with the changes then please close this issue.

KarelPeeters commented 3 years ago

I think there's a small issue in the readme now, ExecuteKernel::new takes a &Kernel and the example is missing the &.

Otherwise everything looks great to me now, thanks a lot for all the help and for improvements!

kenba commented 3 years ago

You are correct, that "ExecuteKernel::new takes a &Kernel and the example is missing the &".
However, all the integration tests are missing the & and compile and run fine without any clippy warnings, so it's not a problem.

KarelPeeters commented 3 years ago

Hmm that's strange I only made this issue because when running the example in the readme myself I got this error:

error[E0308]: mismatched types
   --> main.rs:136:32
    |
136 |             ExecuteKernel::new(kernel)
    |                                ^^^^^^
    |                                |
    |                                expected `&Kernel`, found struct `Kernel`
    |                                help: consider borrowing here: `&kernel`

And it looks like the integration tests do use &kernel, see eg.

https://github.com/kenba/opencl3/blob/9bb74b6d271de490a1919f7b988dda613928844d/tests/integration_test.rs#L116

kenba commented 3 years ago

Sorry Karel, I was wrong. All the Kernel references in the opencl2_kernel_test.rs integration test are missing the & because they are references. You are quite right, all the Kernel references in the integration_test.rs file have the &.

PR #36 will fix this issue as well as making the example a working program.

KarelPeeters commented 3 years ago

Great, then I think this issue is done.

kenba / opencl3

Example from readme has zero output on GTX 1060 Max-Q #32