Open singam-sanjay opened 6 years ago
The way to fix this with nvcc is to specify the correct arch (e.g. -arch=sm_50
). You probably need to do the same for clang.
Btw: The fast BLAS3 kernels in ViennaCL are in the OpenCL backend. I haven't backported them to the CUDA backend yet.
That worked !! thanks !
But, aren't these kernels part of the CUDA backend ?
Yes, they are. But these are not as fast as the kernels generated by the OpenCL backend.
I modified the blas3.cpp file to default to using OPENCL_MEMORY when VIENNACL_WITH_OPENCL was being used and added a new "blas3-ocl" target to compile the example for OpenCL,
--- a/examples/tutorial/blas3.cpp
+++ b/examples/tutorial/blas3.cpp
@@ -140,7 +140,10 @@ ScalarType scaleToNbitIntIfInt(int n_bits)
*/
int main()
{
+#ifdef VIENNACL_WITH_OPENCL
+ viennacl::backend::default_memory_type(viennacl::OPENCL_MEMORY);
+#endif
--- a/examples/tutorial/CMakeLists.txt
+++ b/examples/tutorial/CMakeLists.txt
@@ -104,6 +104,12 @@ if (ENABLE_CUDA)
endif (ENABLE_CUDA)
+if (ENABLE_UBLAS AND ENABLE_OPENCL)
+ include_directories(${Boost_INCLUDE_DIRS})
+ add_executable(blas3-ocl blas3.cpp)
+ set_target_properties(blas3-ocl PROPERTIES COMPILE_FLAGS "-g -DVIENNACL_WITH_OPENCL")
+ target_link_libraries(blas3-ocl ${Boost_LIBRARIES} ${OPENCL_LIBRARIES})
+endif ()
For matrices of size 128x16384 (A) and 16384x128 (B), the opencl code runs slower than CUDA code, OpenCL : 0.036686 secs CUDA : 0.00591053 secs
System setup: Ubuntu 16.04 x86_64 Quadro K1200 CUDA SDK, Drivers and OpenCL packages : nvidia-opencl-dev nvidia-375 nvidia-opencl-icd-375 cuda-8-0
Can the slowdown be attributed to the NVIDIA GPU ?
I'd like to compile blas3.cu with clang++ (yeah !! clang++ can compile CUDA) instead of nvcc to compare the performance of the prod kernels produced. I've built clang and llvm from sources on the release50 branch of each repository and tried building the the program with,
clang++ -DVIENNACL_WITH_CUDA -I/home/seabed/Software/viennacl-dev -I/usr/local/cuda/include ../examples/tutorial/blas3.cu -o examples/tutorial/blas3-clang-cuda -L/usr/local/cuda/lib64 -lcudart_static -ldl -lrt -lpthread -lboost_chrono -lboost_date_time -lboost_serialization -lboost_system -lboost_thread -lboost_atomic -lpthread -O3 -Xcuda-ptxas "-O3 -m64 -fmad true"
The command failed with the __shfl_xor as an undefined intrinsic,
The error persists even after including the header files that declare and define the intrinsic,
Please suggest corrections for this strategy.