Ubuntu + AMDGPUPRO results ?

skn123 commented 7 years ago

I don't think I am able to understand these results. Do they make sense?

The output as provided by clInfo: Number of platforms: 1 Platform Profile: FULL_PROFILE Platform Version: OpenCL 2.0 AMD-APP (2348.3) Platform Name: AMD Accelerated Parallel Processing Platform Vendor: Advanced Micro Devices, Inc. Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

Platform Name: AMD Accelerated Parallel Processing Number of devices: 2 Device Type: CL_DEVICE_TYPE_GPU Vendor ID: 1002h Board name: AMD Radeon(TM) HD 8800 Series Device Topology: PCI[ B#1, D#0, F#0 ] Max compute units: 10 Max work items dimensions: 3 Max work items[0]: 256 Max work items[1]: 256 Max work items[2]: 256 Max work group size: 256 Preferred vector width char: 4 Preferred vector width short: 2 Preferred vector width int: 1 Preferred vector width long: 1 Preferred vector width float: 1 Preferred vector width double: 1 Native vector width char: 4 Native vector width short: 2 Native vector width int: 1 Native vector width long: 1 Native vector width float: 1 Native vector width double: 1 Max clock frequency: 850Mhz Address bits: 64 Max memory allocation: 1152442368 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 8 Max image 2D width: 16384 Max image 2D height: 16384 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 16 Max size of kernel argument: 1024 Alignment (bits) of base address: 2048 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: No Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: Read/Write Cache line size: 64 Cache size: 16384 Global memory size: 1703313408 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Scratchpad Local memory size: 32768 Max pipe arguments: 0 Max pipe active reservations: 0 Max pipe packet size: 0 Max global variable size: 0 Max global variable preferred total size: 0 Max read/write image args: 0 Max on device events: 0 Queue on device max size: 0 Max on device queues: 0 Queue on device preferred size: 0 SVM capabilities:
Coarse grain buffer: No Fine grain buffer: No Fine grain system: No Atomics: No Preferred platform atomic alignment: 0 Preferred global atomic alignment: 0 Preferred local atomic alignment: 0 Kernel Preferred work group size multiple: 64 Error correction support: 0 Unified memory for Host and Device: 0 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities:
Execute OpenCL kernels: Yes Execute native function: No Queue on Host properties:
Out-of-Order: No Profiling : Yes Queue on Device properties:
Out-of-Order: No Profiling : No Platform ID: 0x7ff9b6c83e98 Name: Pitcairn Vendor: Advanced Micro Devices, Inc. Device OpenCL C version: OpenCL C 1.2 Driver version: 2348.3 Profile: FULL_PROFILE Version: OpenCL 1.2 AMD-APP (2348.3) Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event

Device Type: CL_DEVICE_TYPE_CPU Vendor ID: 1002h Board name:
Max compute units: 4 Max work items dimensions: 3 Max work items[0]: 1024 Max work items[1]: 1024 Max work items[2]: 1024 Max work group size: 1024 Preferred vector width char: 16 Preferred vector width short: 8 Preferred vector width int: 4 Preferred vector width long: 2 Preferred vector width float: 8 Preferred vector width double: 4 Native vector width char: 16 Native vector width short: 8 Native vector width int: 4 Native vector width long: 2 Native vector width float: 8 Native vector width double: 4 Max clock frequency: 2400Mhz Address bits: 64 Max memory allocation: 3133924352 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 64 Max image 2D width: 8192 Max image 2D height: 8192 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 16 Max size of kernel argument: 4096 Alignment (bits) of base address: 1024 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: Yes Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: Read/Write Cache line size: 64 Cache size: 16384 Global memory size: 12535697408 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Global Local memory size: 32768 Max pipe arguments: 16 Max pipe active reservations: 16 Max pipe packet size: 3133924352 Max global variable size: 1879048192 Max global variable preferred total size: 1879048192 Max read/write image args: 64 Max on device events: 0 Queue on device max size: 0 Max on device queues: 0 Queue on device preferred size: 0 SVM capabilities:
Coarse grain buffer: No Fine grain buffer: No Fine grain system: No Atomics: No Preferred platform atomic alignment: 0 Preferred global atomic alignment: 0 Preferred local atomic alignment: 0 Kernel Preferred work group size multiple: 1 Error correction support: 0 Unified memory for Host and Device: 1 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities:
Execute OpenCL kernels: Yes Execute native function: Yes Queue on Host properties:
Out-of-Order: No Profiling : Yes Queue on Device properties:
Out-of-Order: No Profiling : No Platform ID: 0x7ff9b6c83e98 Name: AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G Vendor: AuthenticAMD Device OpenCL C version: OpenCL C 1.2 Driver version: 2348.3 (sse2,avx,fma4) Profile: FULL_PROFILE Version: OpenCL 1.2 AMD-APP (2348.3) Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_spir cl_khr_gl_event

The output as provided by nmf:

_Input matrices: V7,6 W7,3 H3,6

Computing NMF RESULT: V7,6 W7,3 H3,6

W*H:

7,6_

------- Tutorial completed -------- This is a benchmark:

./dense_blas-bench-opencl Device Info

Name: Pitcairn Vendor: Advanced Micro Devices, Inc. Type: GPU Available: 1 Max Compute Units: 10 Max Work Group Size: 256 Global Mem Size: 1703444480 Local Mem Size: 32768 Local Mem Type: 1 Host Unified Memory: 0 Benchmark : BLAS

sCOPY : 40.4 GB/s sAXPY : 31.2 GB/s sDOT : 33.1 GB/s sGEMV-N : 17.3 GB/s sGEMV-T : 31.4 GB/s sGEMM-NN : 245 GFLOPs/s sGEMM-NT : 237 GFLOPs/s sGEMM-TN : 254 GFLOPs/s sGEMM-TT : 290 GFLOPs/s

dCOPY : 46.2 GB/s dAXPY : 42.3 GB/s dDOT : 35.8 GB/s dGEMV-N : 29.6 GB/s dGEMV-T : 34.8 GB/s dGEMM-NN : 105 GFLOPs/s dGEMM-NT : 106 GFLOPs/s dGEMM-TN : 105 GFLOPs/s dGEMM-TT : 94.9 GFLOPs/s adminspin@adminspin-System-Product-Name:~/build/viennacl/examples/benchmarks$ ./dense_blas-bench-cpu Benchmark : BLAS

sCOPY : 5.02 GB/s sAXPY : 7.14 GB/s sDOT : 1.75 GB/s sGEMV-N : 4.47 GB/s sGEMV-T : 2.91 GB/s sGEMM-NN : 1.25 GFLOPs/s sGEMM-NT : 1.21 GFLOPs/s sGEMM-TN : 1.27 GFLOPs/s sGEMM-TT : 1.26 GFLOPs/s

dCOPY : 4.99 GB/s dAXPY : 7.13 GB/s dDOT : 3.52 GB/s dGEMV-N : 4.51 GB/s dGEMV-T : 4.19 GB/s dGEMM-NN : 1.22 GFLOPs/s dGEMM-NT : 1.17 GFLOPs/s dGEMM-TN : 1.24 GFLOPs/s dGEMM-TT : 1.2 GFLOPs/s

karlrupp commented 7 years ago

Could you please restate your question? I'm afraid I don't know what you are asking here.

skn123 commented 7 years ago

The numbers didnt come out correctly After doing NMF.. I get the following: W*H [7,6] ((0,0,0,0,0,0),(0,0,0,0,0,0),(0,0,0,0,0,0),(0,0,0,0,0,0),(0,0,0,0,0,0),(0,0,0,0,0,0),(0,0,0,0,0,0)) Is this correct? The host and device tests look plausible

karl, please reformat my question. I am unable to reformat it myself :(

karlrupp commented 7 years ago

Oh, this doesn't sound reasonable. If possible, can you run tests/nmf-test-opencl? It requires ENABLE_TESTING to be activated in CMake and also uBLAS to be available on the system.

viennacl / viennacl-dev