Compiling blas3.cu with clang++

singam-sanjay commented 6 years ago

I'd like to compile blas3.cu with clang++ (yeah !! clang++ can compile CUDA) instead of nvcc to compare the performance of the prod kernels produced. I've built clang and llvm from sources on the release50 branch of each repository and tried building the the program with,

clang++ -DVIENNACL_WITH_CUDA -I/home/seabed/Software/viennacl-dev -I/usr/local/cuda/include ../examples/tutorial/blas3.cu -o examples/tutorial/blas3-clang-cuda -L/usr/local/cuda/lib64 -lcudart_static -ldl -lrt -lpthread -lboost_chrono -lboost_date_time -lboost_serialization -lboost_system -lboost_thread -lboost_atomic -lpthread -O3 -Xcuda-ptxas "-O3 -m64 -fmad true"

The command failed with the __shfl_xor as an undefined intrinsic,

In file included from ../examples/tutorial/blas3.cu:56:
In file included from viennacl-dev/viennacl/matrix.hpp:29:
In file included from viennacl-dev/viennacl/linalg/sparse_matrix_operations.hpp:37:
In file included from viennacl-dev/viennacl/linalg/cuda/sparse_matrix_operations.hpp:35:
viennacl-dev/viennacl/linalg/cuda/spgemm_rmerge.hpp:146:32: error: use of undeclared identifier '__shfl_xor'
    min_index = min(min_index, __shfl_xor((int)min_index, (int)i));
                               ^
viennacl-dev/viennacl/linalg/cuda/spgemm_rmerge.hpp:235:21: error: use of undeclared identifier '__shfl_xor'
    output_value += __shfl_xor((int)output_value, (int)i);

The error persists even after including the header files that declare and define the intrinsic,

--- a/examples/tutorial/blas3.cpp
+++ b/examples/tutorial/blas3.cpp
@@ -30,10 +30,14 @@
+#include <sm_30_intrinsics.h>
+#include <sm_30_intrinsics.hpp>

Please suggest corrections for this strategy.

karlrupp commented 6 years ago

The way to fix this with nvcc is to specify the correct arch (e.g. -arch=sm_50). You probably need to do the same for clang.

Btw: The fast BLAS3 kernels in ViennaCL are in the OpenCL backend. I haven't backported them to the CUDA backend yet.

singam-sanjay commented 6 years ago

That worked !! thanks !

singam-sanjay commented 6 years ago

But, aren't these kernels part of the CUDA backend ?

karlrupp commented 6 years ago

Yes, they are. But these are not as fast as the kernels generated by the OpenCL backend.

singam-sanjay commented 6 years ago

I modified the blas3.cpp file to default to using OPENCL_MEMORY when VIENNACL_WITH_OPENCL was being used and added a new "blas3-ocl" target to compile the example for OpenCL,

--- a/examples/tutorial/blas3.cpp
+++ b/examples/tutorial/blas3.cpp
@@ -140,7 +140,10 @@ ScalarType scaleToNbitIntIfInt(int n_bits)
 */
 int main()
 {
+#ifdef VIENNACL_WITH_OPENCL
+       viennacl::backend::default_memory_type(viennacl::OPENCL_MEMORY);
+#endif

--- a/examples/tutorial/CMakeLists.txt
+++ b/examples/tutorial/CMakeLists.txt
@@ -104,6 +104,12 @@ if (ENABLE_CUDA)

 endif (ENABLE_CUDA)

+if (ENABLE_UBLAS AND ENABLE_OPENCL)
+  include_directories(${Boost_INCLUDE_DIRS})
+  add_executable(blas3-ocl blas3.cpp)
+  set_target_properties(blas3-ocl PROPERTIES COMPILE_FLAGS "-g -DVIENNACL_WITH_OPENCL")
+  target_link_libraries(blas3-ocl ${Boost_LIBRARIES} ${OPENCL_LIBRARIES})
+endif ()

For matrices of size 128x16384 (A) and 16384x128 (B), the opencl code runs slower than CUDA code, OpenCL : 0.036686 secs CUDA : 0.00591053 secs

System setup: Ubuntu 16.04 x86_64 Quadro K1200 CUDA SDK, Drivers and OpenCL packages : nvidia-opencl-dev nvidia-375 nvidia-opencl-icd-375 cuda-8-0

Can the slowdown be attributed to the NVIDIA GPU ?

viennacl / viennacl-dev

Compiling blas3.cu with clang++ #245