lukeiwanski / tensorflow

OpenCL support for TensorFlow via SYCL
Apache License 2.0
65 stars 14 forks source link

Basic gpu test failed in ubuntu 16.04 [AMD memory caps] #167

Closed gauthampughazhendhi closed 5 years ago

gauthampughazhendhi commented 6 years ago

When I run the following command

bazel test -c opt --config=sycl --test_output=all //tensorflow/python/kernel_tests:basic_gpu_test

the following log is displayed,

WARNING: /home/gautham/tensorflow/tensorflow/core/BUILD:1780:1: in includes attribute of cc_library rule //tensorflow/core:framework_headers_lib: '../../external/nsync/public' resolves to 'external/nsync/public' not below the relative path of its package 'tensorflow/core'. This will be an error in the future. Since this rule was created by the macro 'cc_header_only_library', the error might have been caused by the macro implementation in /home/gautham/tensorflow/tensorflow/tensorflow.bzl:1029:30
INFO: Analysed target //tensorflow/python/kernel_tests:basic_gpu_test (0 packages loaded).
INFO: Found 1 test target...
FAIL: //tensorflow/python/kernel_tests:basic_gpu_test (see /home/gautham/.cache/bazel/_bazel_gautham/f3cb0296cc06462eaad2b67eee0ab414/execroot/org_tensorflow/bazel-out/local_linux-py3-opt/testlogs/tensorflow/python/kernel_tests/basic_gpu_test/test.log)
INFO: From Testing //tensorflow/python/kernel_tests:basic_gpu_test:
==================== Test output for //tensorflow/python/kernel_tests:basic_gpu_test:
2017-10-10 13:54:12.295721: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
terminate called after throwing an instance of 'cl::sycl::cl_exception'
  what():  Error: [ComputeCpp:RT0407] Failed to create OpenCL command queue
Aborted (core dumped)
================================================================================
Target //tensorflow/python/kernel_tests:basic_gpu_test up-to-date:
  bazel-bin/tensorflow/python/kernel_tests/basic_gpu_test
INFO: Elapsed time: 15.027s, Critical Path: 3.43s
INFO: Build completed, 1 test FAILED, 2 total actions
//tensorflow/python/kernel_tests:basic_gpu_test                          FAILED in 2.5s
  /home/gautham/.cache/bazel/_bazel_gautham/f3cb0296cc06462eaad2b67eee0ab414/execroot/org_tensorflow/bazel-out/local_linux-py3-opt/testlogs/tensorflow/python/kernel_tests/basic_gpu_test/test.log

Executed 1 out of 1 test: 1 fails locally.

And my computecpp_info result is as follows,


ComputeCpp Info (CE 0.3.2)


Toolchain information:

GLIBC version: 2.23 GLIBCXX: 20160609 This version of libstdc++ is supported.


Device Info:

Discovered 1 devices matching: platform : device type :


Device 0:

Device is supported : UNTESTED - Vendor not tested on this OS CL_DEVICE_NAME : Capeverde CL_DEVICE_VENDOR : Advanced Micro Devices, Inc. CL_DRIVER_VERSION : 2442.7 CL_DEVICE_TYPE : CL_DEVICE_TYPE_GPU

If you encounter problems when using any of these OpenCL devices, please consult this website for known issues: https://computecpp.codeplay.com/releases/v0.3.2/platform-support-notes


I don't know where it is going wrong.

lukeiwanski commented 6 years ago

Thanks for reporting. We will have a look and get back to you ASAP

gauthampughazhendhi commented 6 years ago

The log of clinfo is as follows,

[Number of platforms: 1 Platform Profile: FULL_PROFILE Platform Version: OpenCL 2.0 AMD-APP (2442.7) Platform Name: AMD Accelerated Parallel Processing Platform Vendor: Advanced Micro Devices, Inc. Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

Platform Name: AMD Accelerated Parallel Processing Number of devices: 1 Device Type: CL_DEVICE_TYPE_GPU Vendor ID: 1002h Board name: AMD Radeon HD 8800M Series Device Topology: PCI[ B#3, D#0, F#0 ] Max compute units: 5 Max work items dimensions: 3 Max work items[0]: 256 Max work items[1]: 256 Max work items[2]: 256 Max work group size: 256 Preferred vector width char: 4 Preferred vector width short: 2 Preferred vector width int: 1 Preferred vector width long: 1 Preferred vector width float: 1 Preferred vector width double: 1 Native vector width char: 4 Native vector width short: 2 Native vector width int: 1 Native vector width long: 1 Native vector width float: 1 Native vector width double: 1 Max clock frequency: 400Mhz Address bits: 32 Max memory allocation: 1409077248 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 8 Max image 2D width: 16384 Max image 2D height: 16384 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 16 Max size of kernel argument: 1024 Alignment (bits) of base address: 2048 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: No Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: Read/Write Cache line size: 64 Cache size: 16384 Global memory size: 2140639232 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Scratchpad Local memory size: 32768 Max pipe arguments: 0 Max pipe active reservations: 0 Max pipe packet size: 0 Max global variable size: 0 Max global variable preferred total size: 0 Max read/write image args: 0 Max on device events: 0 Queue on device max size: 0 Max on device queues: 0 Queue on device preferred size: 0 SVM capabilities:
Coarse grain buffer: No Fine grain buffer: No Fine grain system: No Atomics: No Preferred platform atomic alignment: 0 Preferred global atomic alignment: 0 Preferred local atomic alignment: 0 Kernel Preferred work group size multiple: 64 Error correction support: 0 Unified memory for Host and Device: 0 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities:
Execute OpenCL kernels: Yes Execute native function: No Queue on Host properties:
Out-of-Order: No Profiling : Yes Queue on Device properties:
Out-of-Order: No Profiling : No Platform ID: 0x7f24454a7478 Name: Capeverde Vendor: Advanced Micro Devices, Inc. Device OpenCL C version: OpenCL C 1.2 Driver version: 2442.7 Profile: FULL_PROFILE Version: OpenCL 1.2 AMD-APP (2442.7) Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event ](url)

DuncanMcBain commented 6 years ago

Hi @GauthamPughaz, I can see where in the code that this error is being thrown from, but I don't really understand why. If you don't mind, it might be useful to try the more simple tests from the SDK we maintain: https://github.com/codeplaysoftware/computecpp-sdk Hopefully, these tests will be simple enough that we can debug exactly what's going on here! They're very easy to set up - details are there, but running the following from the root directory of the SDK would work, provided you change the install location to wherever you've saved the ComputeCpp files (in the same way you specify when configuring tensorflow to use OpenCL):

mkdir build && cd build
cmake ../samples -DCOMPUTECPP_PACKAGE_ROOT_DIR=/path/to/computecpp/install
make -j4
ctest

It'd be really helpful to us if you did have a look at this, though I appreciate it's a pain when things don't "just work". Many thanks, Duncan.

gauthampughazhendhi commented 6 years ago

Yea I have done what you said. The log for that is,

gautham@gautham-dell:~/computecpp-sdk/build$ ctest Test project /home/gautham/computecpp-sdk/build Start 1: scan 1/21 Test #1: scan .............................Exception: Other 0.43 sec Start 2: images 2/21 Test #2: images ...........................Failed 0.04 sec Start 3: opencl_c_interop 3/21 Test #3: opencl_c_interop .................Exception: Other 0.12 sec Start 4: example_vptr 4/21 Test #4: example_vptr .....................Exception: Other 0.12 sec Start 5: simple_local_barrier 5/21 Test #5: simple_local_barrier .............Exception: Other 0.12 sec Start 6: simple_example_of_vectors 6/21 Test #6: simple_example_of_vectors ........Exception: Other 0.12 sec Start 7: example_sycl_application 7/21 Test #7: example_sycl_application .........Exception: Other 0.12 sec Start 8: using_functors 8/21 Test #8: using_functors ...................Failed 0.01 sec Start 9: parallel_for 9/21 Test #9: parallel_for .....................Failed 0.01 sec Start 10: hello_world 10/21 Test #10: hello_world ......................Exception: Other 0.12 sec Start 11: accessors 11/21 Test #11: accessors ........................Failed 0.01 sec Start 12: simple_private_memory 12/21 Test #12: simple_private_memory ............Exception: Other 0.12 sec Start 13: matrix_multiply_omp 13/21 Test #13: matrix_multiply_omp .............. Passed 0.00 sec Start 14: matrix_multiply_sycl 14/21 Test #14: matrix_multiply_sycl .............Exception: Other 0.12 sec Start 15: custom_device_selector 15/21 Test #15: custom_device_selector ...........Exception: Other 0.12 sec Start 16: template_functor 16/21 Test #16: template_functor .................Exception: Other 0.12 sec Start 17: smart_pointer 17/21 Test #17: smart_pointer ....................Exception: Other 0.12 sec Start 18: gaussian_blur 18/21 Test #18: gaussian_blur ....................Exception: Other 0.14 sec Start 19: reduction 19/21 Test #19: reduction ........................Exception: Other 0.12 sec Start 20: simple_vector_add 20/21 Test #20: simple_vector_add ................Exception: Other 0.12 sec Start 21: async_handler 21/21 Test #21: async_handler ....................Exception: Other 0.12 sec

5% tests passed, 20 tests failed out of 21

Total Test time (real) = 2.34 sec

The following tests FAILED: 1 - scan (OTHER_FAULT) 2 - images (Failed) 3 - opencl_c_interop (OTHER_FAULT) 4 - example_vptr (OTHER_FAULT) 5 - simple_local_barrier (OTHER_FAULT) 6 - simple_example_of_vectors (OTHER_FAULT) 7 - example_sycl_application (OTHER_FAULT) 8 - using_functors (Failed) 9 - parallel_for (Failed) 10 - hello_world (OTHER_FAULT) 11 - accessors (Failed) 12 - simple_private_memory (OTHER_FAULT) 14 - matrix_multiply_sycl (OTHER_FAULT) 15 - custom_device_selector (OTHER_FAULT) 16 - template_functor (OTHER_FAULT) 17 - smart_pointer (OTHER_FAULT) 18 - gaussian_blur (OTHER_FAULT) 19 - reduction (OTHER_FAULT) 20 - simple_vector_add (OTHER_FAULT) 21 - async_handler (OTHER_FAULT) Errors while running CTest

DuncanMcBain commented 6 years ago

Hi @GauthamPughaz, thanks for that. It looks like nothing that uses SYCL is working which is quite confusing! I'd really appreciate it if you could try the following code to see what the underlying error is:

#include <CL/sycl.hpp>
#include <iostream>

int main() {
  try {
    cl::sycl::queue q;
  } catch(cl::sycl::cl_exception& e) {
    std::cout << e.what() << "\n";
    std::cout << "Error: " << e.get_cl_error() << ": " << e.get_cl_error_message();
  }
  return 0;
}

This should print out exactly what error code is being returned by the clCreateCommandQueue() function. It is very rare for this to fail, so I'd be very interested to know what's going on!

Many thanks, Duncan.

EDIT: I've added the missing SYCL header, because I foolishly forgot it!

gauthampughazhendhi commented 6 years ago

Sorry for this question. How and where to run this as this requires sycl.hpp header file.

DuncanMcBain commented 6 years ago

Whoops, you're right! I think the easiest thing would be to edit one of the samples in the SDK, such that the body of the code is replaced with that - the hello world sample is very simple. I've included a patch that you could apply to the SDK. Simply copy the contents to a file in the root of the directory, then:

git apply testing-queues.patch
cd build
make hello_world
hello_world/hello_world

I've actually compiled and run it this time, though it passes without error for me! Sorry for getting that wrong earlier.

diff --git a/samples/hello_world/hello_world.cpp b/samples/hello_world/hello_world.cpp
index 2f65f49..20e8fa8 100644
--- a/samples/hello_world/hello_world.cpp
+++ b/samples/hello_world/hello_world.cpp
@@ -34,42 +34,11 @@
  * (as determined by the SYCL implementation) whose only function is to
  * output the canonical "hello world" string. */
 int main() {
-  /* Selectors determine which device kernels will be dispatched to.
-   * Try using a host_selector, too! */
-  cl::sycl::default_selector selector;
-
-  /* Queues are used to enqueue work.
-   * In this case we construct the queue using the selector. Users can create
-   * their own selectors to choose whatever kind of device they might need. */
-  cl::sycl::queue myQueue(selector);
-  std::cout << "Running on "
-            << myQueue.get_device().get_info<cl::sycl::info::device::name>()
-            << "\n";
-
-  /* C++ 11 lambda functions can be used to submit work to the queue.
-   * They set up data transfers, kernel compilation and the actual
-   * kernel execution. This submission has no data, only a "stream" object.
-   * Useful in debugging, it is a lot like an std::ostream. The handler
-   * object is used to control the scope certain operations can be done. */
-  myQueue.submit([&](cl::sycl::handler& cgh) {
-    /* The stream object allows output to be generated from the kernel. It
-     * takes three parameters in its constructor. The first is the maximum
-     * output size in bytes, the second is how large [in bytes] the total
-     * output of any single << chain can be, and the third is the cgh,
-     * ensuring it can only be constructed inside a submit() call. */
-    cl::sycl::stream os(1024, 80, cgh);
-
-    /* single_task is the simplest way of executing a kernel on a
-     * SYCL device. A single thread executes the code inside the kernel
-     * lambda. The template parameter needs to be a unique name that
-     * the runtime can use to identify the kernel (since lambdas have
-     * no accessible name). */
-    cgh.single_task<class hello_world>([=]() {
-      /* We use the stream operator on the stream object we created above
-       * to print to stdout from the device. */
-      os << "Hello, World!\n";
-    });
-  });
-
+  try {
+    cl::sycl::queue q;
+  } catch(cl::sycl::cl_exception& e) {
+    std::cout << e.what() << "\n";
+    std::cout << "Error: " << e.get_cl_code() << ": " << e.get_cl_error_message();
+  }
   return 0;
 }
gauthampughazhendhi commented 6 years ago

Got it, the log is shown below.

terminate called after throwing an instance of 'cl::sycl::cl_exception' what(): Error: [ComputeCpp:RT0408] Error querying the number of OpenCL platforms in the system (Cannot query number of platforms on second attempt ) Aborted (core dumped)

gauthampughazhendhi commented 6 years ago

Is this sufficient?

DuncanMcBain commented 6 years ago

Hi, sorry for not responding. This is a really bizarre error which doesn't make a lot of sense to me. There is barely any ComputeCpp code running at this point when the program crashes. Are you running in some kind of virtual environment, like a docker container, or something else maybe?

gauthampughazhendhi commented 6 years ago

No, not anything like that. Is there any other way to debug this?.

gauthampughazhendhi commented 6 years ago

My /usr/local/computecpp/ directory contains the following folders bin doc include lib Is this fine?

gauthampughazhendhi commented 6 years ago

I repeated the process from step one and I checked it with a test, it failed again. The log for that fail of basic gpu test.

WARNING: /home/gautham/tensorflow/tensorflow/core/BUILD:1782:1: in includes attribute of cc_library rule //tensorflow/core:framework_headers_lib: '../../external/nsync/public' resolves to 'external/nsync/public' not below the relative path of its package 'tensorflow/core'. This will be an error in the future. Since this rule was created by the macro 'cc_header_only_library', the error might have been caused by the macro implementation in /home/gautham/tensorflow/tensorflow/tensorflow.bzl:1048:30
INFO: Analysed target //tensorflow/python/kernel_tests:basic_gpu_test (2 packages loaded).
INFO: Found 1 test target...
Building: no action running
FAIL: //tensorflow/python/kernel_tests:basic_gpu_test (see /home/gautham/.cache/bazel/_bazel_gautham/f3cb0296cc06462eaad2b67eee0ab414/execroot/org_tensorflow/bazel-out/local_linux-opt/testlogs/tensorflow/python/kernel_tests/basic_gpu_test/test.log)
INFO: From Testing //tensorflow/python/kernel_tests:basic_gpu_test:
==================== Test output for //tensorflow/python/kernel_tests:basic_gpu_test:
2017-10-13 05:57:32.089958: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
terminate called after throwing an instance of 'cl::sycl::cl_exception'
  what():  Error: [ComputeCpp:RT0407] Failed to create OpenCL command queue
Aborted (core dumped)
================================================================================
Target //tensorflow/python/kernel_tests:basic_gpu_test up-to-date:
  bazel-bin/tensorflow/python/kernel_tests/basic_gpu_test
INFO: Elapsed time: 86.771s, Critical Path: 5.92s
INFO: Build completed, 1 test FAILED, 5 total actions
//tensorflow/python/kernel_tests:basic_gpu_test                          FAILED in 3.5s
  /home/gautham/.cache/bazel/_bazel_gautham/f3cb0296cc06462eaad2b67eee0ab414/execroot/org_tensorflow/bazel-out/local_linux-opt/testlogs/tensorflow/python/kernel_tests/basic_gpu_test/test.log

Executed 1 out of 1 test: 1 fails locally.
DuncanMcBain commented 6 years ago

I'm sorry I'm being slow on this, it is just that this is rather tricky to debug. What this is saying is that the queue creation part of ComputeCpp is failing, but this is a very simple, basic operation. Could you try this modified patch again on the SDK to see if there's a difference in output? After this the only thing I can think to try is plain OpenCL code.

The difference here is that I've asked it to use a default selector. The behaviour ideally should be the same but it looks like it might be slightly different - I guess we can see what happens. Thanks for sticking with this!

diff --git a/samples/hello_world/hello_world.cpp b/samples/hello_world/hello_world.cpp
index 2f65f49..20e8fa8 100644
--- a/samples/hello_world/hello_world.cpp
+++ b/samples/hello_world/hello_world.cpp
@@ -34,42 +34,12 @@
  * (as determined by the SYCL implementation) whose only function is to
  * output the canonical "hello world" string. */
 int main() {
-  /* Selectors determine which device kernels will be dispatched to.
-   * Try using a host_selector, too! */
-  cl::sycl::default_selector selector;
-
-  /* Queues are used to enqueue work.
-   * In this case we construct the queue using the selector. Users can create
-   * their own selectors to choose whatever kind of device they might need. */
-  cl::sycl::queue myQueue(selector);
-  std::cout << "Running on "
-            << myQueue.get_device().get_info<cl::sycl::info::device::name>()
-            << "\n";
-
-  /* C++ 11 lambda functions can be used to submit work to the queue.
-   * They set up data transfers, kernel compilation and the actual
-   * kernel execution. This submission has no data, only a "stream" object.
-   * Useful in debugging, it is a lot like an std::ostream. The handler
-   * object is used to control the scope certain operations can be done. */
-  myQueue.submit([&](cl::sycl::handler& cgh) {
-    /* The stream object allows output to be generated from the kernel. It
-     * takes three parameters in its constructor. The first is the maximum
-     * output size in bytes, the second is how large [in bytes] the total
-     * output of any single << chain can be, and the third is the cgh,
-     * ensuring it can only be constructed inside a submit() call. */
-    cl::sycl::stream os(1024, 80, cgh);
-
-    /* single_task is the simplest way of executing a kernel on a
-     * SYCL device. A single thread executes the code inside the kernel
-     * lambda. The template parameter needs to be a unique name that
-     * the runtime can use to identify the kernel (since lambdas have
-     * no accessible name). */
-    cgh.single_task<class hello_world>([=]() {
-      /* We use the stream operator on the stream object we created above
-       * to print to stdout from the device. */
-      os << "Hello, World!\n";
-    });
-  });
-
+  try {
+    cl::sycl::default_selector ds;
+    cl::sycl::queue q(ds);
+  } catch(cl::sycl::cl_exception& e) {
+    std::cout << e.what() << "\n";
+    std::cout << "Error: " << e.get_cl_code() << ": " << e.get_cl_error_message();
+  }
   return 0;
 }
gauthampughazhendhi commented 6 years ago

terminate called after throwing an instance of 'cl::sycl::cl_exception' what(): Error: [ComputeCpp:RT0408] Error querying the number of OpenCL platforms in the system (Cannot query number of platforms on second attempt ) Aborted (core dumped)

gauthampughazhendhi commented 6 years ago

While building in my second attempt, out of 300 tests only 47 passed.

DuncanMcBain commented 6 years ago

OK. I don't understand what's happening here. This is the function that's failing: https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clGetPlatformIDs.html

I suppose I should explain a little. ComputeCpp discovers all the different OpenCL platforms available on the system through this function. If it doesn't work... there's nothing we can do. Are you sure you have installed a driver on your system that is properly supported on your hardware? I had a look at the available driver downloads for your hardware, and the only one available is the (now rather old) Crimson driver. AMD GPUPRO supports some mobile chips (like the R9 M300 series for example) but not the R9 M200 cards. ComputeCpp requires OpenCL 1.2 and OpenCL SPIR 1.2 to run - if your system can't support that, we can't run.

(This would explain why so many tests fail - those that attempt to instantiate any kind of device or queue will fail rapidly.)

gauthampughazhendhi commented 6 years ago

Ok, thank you so much. One final query, is the command export COMPUTE=:0 required for my setup, since it was listed for ubuntu 14.04 and not for 16.04. Another thing, I just checked my system and found I have two GPU models on my system. gautham@gautham-dell:~/computecpp-sdk/build$ lspci | grep VGA 00:02.0 VGA compatible controller: Intel Corporation Haswell-ULT Integrated Graphics Controller (rev 09) 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Venus PRO [Radeon HD 8850M / R9 M265X] (rev ff) Is that an issue?

DuncanMcBain commented 6 years ago

It's one GPU. The M265X is, to some greater or lesser extent, a rebranded 8850M, from what I can tell. Export COMPUTE=:0 means "use the first device on the system". If you don't have any other devices, it shouldn't change anything. (@lukeiwanski might be able to add more on that point.)

lukeiwanski commented 6 years ago

@GauthamPughaz no you don't need export COMPUTE=:0 that was only relevant on 14.04

gauthampughazhendhi commented 6 years ago

Ok, thank you.

davide-maestroni commented 6 years ago

Any update on this? I'm facing exactly the same issue with ComputeCpp 0.3.3 (tried both dev/eigen_mehdi and dev/amd_gpu branches).

The output from clinfo is:

Number of platforms: 1 Platform Profile: FULL_PROFILE Platform Version: OpenCL 2.0 AMD-APP (2442.7) Platform Name: AMD Accelerated Parallel Processing Platform Vendor: Advanced Micro Devices, Inc. Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

Platform Name: AMD Accelerated Parallel Processing Number of devices: 1 Device Type: CL_DEVICE_TYPE_GPU Vendor ID: 1002h Board name: AMD Radeon R9 (TM) M370X Device Topology: PCI[ B#1, D#0, F#0 ] Max compute units: 5 Max work items dimensions: 3 Max work items[0]: 256 Max work items[1]: 256 Max work items[2]: 256 Max work group size: 256 Preferred vector width char: 4 Preferred vector width short: 2 Preferred vector width int: 1 Preferred vector width long: 1 Preferred vector width float: 1 Preferred vector width double: 1 Native vector width char: 4 Native vector width short: 2 Native vector width int: 1 Native vector width long: 1 Native vector width float: 1 Native vector width double: 1 Max clock frequency: 800Mhz Address bits: 32 Max memory allocation: 844744704 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 8 Max image 2D width: 16384 Max image 2D height: 16384 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 16 Max size of kernel argument: 1024 Alignment (bits) of base address: 2048 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: No Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: Read/Write Cache line size: 64 Cache size: 16384 Global memory size: 1366024192 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Scratchpad Local memory size: 32768 Max pipe arguments: 0 Max pipe active reservations: 0 Max pipe packet size: 0 Max global variable size: 0 Max global variable preferred total size: 0 Max read/write image args: 0 Max on device events: 0 Queue on device max size: 0 Max on device queues: 0 Queue on device preferred size: 0 SVM capabilities:
Coarse grain buffer: No Fine grain buffer: No Fine grain system: No Atomics: No Preferred platform atomic alignment: 0 Preferred global atomic alignment: 0 Preferred local atomic alignment: 0 Kernel Preferred work group size multiple: 64 Error correction support: 0 Unified memory for Host and Device: 0 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities:
Execute OpenCL kernels: Yes Execute native function: No Queue on Host properties:
Out-of-Order: No Profiling : Yes Queue on Device properties:
Out-of-Order: No Profiling : No Platform ID: 0x7fe001fdd478 Name: Capeverde Vendor: Advanced Micro Devices, Inc. Device OpenCL C version: OpenCL C 1.2 Driver version: 2442.7 Profile: FULL_PROFILE Version: OpenCL 1.2 AMD-APP (2442.7) Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event

and the one from computecpp_info:


ComputeCpp Info (CE 0.3.3)


Toolchain information:

GLIBC version: 2.23 GLIBCXX: 20160609 This version of libstdc++ is supported.


Device Info:

Discovered 1 devices matching: platform : device type :


Device 0:

Device is supported : UNTESTED - Vendor not tested on this OS CL_DEVICE_NAME : Capeverde CL_DEVICE_VENDOR : Advanced Micro Devices, Inc. CL_DRIVER_VERSION : 2442.7 CL_DEVICE_TYPE : CL_DEVICE_TYPE_GPU

If you encounter problems when using any of these OpenCL devices, please consult this website for known issues: https://computecpp.codeplay.com/releases/v0.3.3/platform-support-notes


I've also run a simple test compiled from https://laanwj.github.io/assets/2016/05/06/opencl-ubuntu1604/devices.c and the output is:

  1. Platform Profile: FULL_PROFILE Version: OpenCL 2.0 AMD-APP (2442.7) Name: AMD Accelerated Parallel Processing Vendor: Advanced Micro Devices, Inc. Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
  2. Device: Capeverde 1.1 Hardware version: OpenCL 1.2 AMD-APP (2442.7) 1.2 Software version: 2442.7 1.3 OpenCL C version: OpenCL C 1.2 1.4 Parallel compute units: 5

I would be more than happy to help finding a solution.

davide-maestroni commented 6 years ago

I've also tried with the hello_world example above, and the output is just:

Error: [ComputeCpp:RT0407] Failed to create OpenCL command queue Segmentation fault (core dumped)

After some tests I found out that the clCreateCommandQueue call fails with error code CL_OUT_OF_HOST_MEMORY Any clue?

DuncanMcBain commented 6 years ago

Hi @davide-maestroni , Thanks for adding to this bug report. Now that we have confirmation of two users being affected by this issue, it's possible it's an AMD problem, particularly since the driver download page indicates that your particular hardware should be supported by this driver. I'm afraid that beyond that, I don't understand what's going on! That particular error is very unusual. It is somehow indicating that the OpenCL implementation (i.e. AMD's implementation) does not have enough memory to create even a basic queue. Given that this happens on the smallest tests we have, I'm stumped! The code you linked does not create a queue, which is presumably why it does not fail.

As the documentation on this page suggests, a general failure to allocate resources on the host is what causes CL_OUT_OF_HOST_MEMORY. Quick Googling returns some information, if not fixes: https://developer.blender.org/T50761 https://github.com/fireice-uk/xmr-stak-amd/issues/69 https://github.com/fireice-uk/xmr-stak-amd/issues/53 https://stackoverflow.com/questions/39864947/opencl-cl-out-of-host-memory-on-clcreatecommandqueuewithproperties-with-minima

davide-maestroni commented 6 years ago

Thanks @DuncanMcBain, I actually had the same suspect, that is the issue was related to the AMD drivers. I was just hoping you had a better understanding of the problem. I also filed an issue to the Codeplay guys. Let's see what they have to say.

DuncanMcBain commented 6 years ago

I am a Codeplay guy! I'm sorry, I should have answered there as well, but thought it better to leave it on a public forum. In short: I'm the one who answers those issues too, and I still have no idea :smile:

While AMD are pushing their RoCM and Hipify technologies, I think it would be beneficial if they still supported OpenCL on a variety of hardware. Unfortunately, the only Mxxx AMD hardware we have in the company is an older chip, too old to use the new drivers (I tried, it doesn't boot if you use them). Would you be able to pass on this program to AMD? I might be able to attach a plain OpenCL program that creates a queue in the same way if they want an OpenCL repro case (I'd be surprised if this were an artifact of the way that ComputeCpp was doing things). You could try this: https://github.com/HandsOnOpenCL/Exercises-Solutions. It should work "out of the box" and also fail in the same way!

mirh commented 6 years ago

@davide-maestroni (and @GauthamPughaz) if you read the second of those links you see there's an alleged solution. TL;DR AFAIK you just have amd (for some reason) setting very conservative default memory caps

I suppose computecpp should have no problems exporting GPU_FORCE_64BIT_PTR or GPU_USE_SYNC_OBJECTS vars

EDIT: https://bugs.freedesktop.org/show_bug.cgi?id=102491#c5

lukeiwanski commented 6 years ago

@GauthamPughaz is that still an issue? or can it be closed?

gauthampughazhendhi commented 6 years ago

On Fri, 9 Mar 2018 at 4:19 AM, Luke Iwanski notifications@github.com wrote:

@GauthamPughaz https://github.com/gauthampughaz is that still an issue? or can it be closed?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lukeiwanski/tensorflow/issues/167#issuecomment-371652092, or mute the thread https://github.com/notifications/unsubscribe-auth/AK2b5j4tVa7IIfQyGlufFvRVyfWq_0TKks5tcbWTgaJpZM4Pz_sM .

Hi Lukeiwanaki,

Its still an issue and I didn’t get any solutions so far when I searched for it.

mirh commented 6 years ago

.... Did you read/try what I posted?

RafalKonklewski commented 6 years ago

@mirh The solution from your post really works, thanks a lot !

For people who didn't read it: Add this to your ~/.bashrc file: export GPU_FORCE_64BIT_PTR=1 export GPU_USE_SYNC_OBJECTS=1 export GPU_MAX_ALLOC_PERCENT=100 export GPU_SINGLE_ALLOC_PERCENT=100 export GPU_MAX_HEAP_SIZE=100