lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
279 stars 94 forks source link

Query: QUDA Feature-SYCL branch #1332

Open Soujanyajanga opened 1 year ago

Soujanyajanga commented 1 year ago

In the QUDA feature/sycl branch, is this SYCL backend fully functional. Does it work on NVIDIA as well or is it intended only for INTEL architectures. Please share the steps to excise tests on INTEL/NVIDIA platform.

jcosborn commented 1 year ago

It is essentially fully functional. Depending on which version of oneapi and hardware you run with there may be some issues though. It requires Intel SYCL since it uses some Intel extensions. I've only tried it on Intel hardware, but it might run with the CUDA backend for Intel LLVM as well. Note that there are some changes to follow the SYCL 2020 spec that are in the upstream Intel LLVM repo which I haven't updated the code for yet. It should work with the current public oneapi release though. An example build and test commands (which will need updating soon) are below.

export QUDA_TARGET=SYCL export CXX=dpcpp o="$o -DCMAKE_INSTALL_PREFIX=" o="$o -DQUDA_FAST_COMPILE_REDUCE=ON" o="$o -DQUDA_BUILD_SHAREDLIB=OFF" cmake $o

make make test

Soujanyajanga commented 1 year ago

It is essentially fully functional. Depending on which version of oneapi and hardware you run with there may be some issues though. It requires Intel SYCL since it uses some Intel extensions. I've only tried it on Intel hardware, but it might run with the CUDA backend for Intel LLVM as well. Note that there are some changes to follow the SYCL 2020 spec that are in the upstream Intel LLVM repo which I haven't updated the code for yet. It should work with the current public oneapi release though. An example build and test commands (which will need updating soon) are below.

export QUDA_TARGET=SYCL export CXX=dpcpp o="$o -DCMAKE_INSTALL_PREFIX=" o="$o -DQUDA_FAST_COMPILE_REDUCE=ON" o="$o -DQUDA_BUILD_SHAREDLIB=OFF" cmake $o

make make test

Was this SYCL backend tested with CLANG compiler.

jcosborn commented 1 year ago

I've only tested it with dpcpp/icpx.

Soujanyajanga commented 1 year ago

I've only tested it with dpcpp/icpx.

Following error is observed with latest code [ 12%] Building CXX object lib/CMakeFiles/quda_cpp.dir/inv_ca_gcr.cpp.o [ 12%] Building CXX object lib/CMakeFiles/quda_cpp.dir/dirac.cpp.o clang-16: error: unknown argument: '-fhonor-nan-compares' clang-16: error: unknown argument: '-fhonor-nan-compares'

Soujanyajanga commented 1 year ago

@jcosborn with latest intel LLVM compiler

Following error is observed with latest code [ 12%] Building CXX object lib/CMakeFiles/quda_cpp.dir/inv_ca_gcr.cpp.o [ 12%] Building CXX object lib/CMakeFiles/quda_cpp.dir/dirac.cpp.o clang-16: error: unknown argument: '-fhonor-nan-compares' clang-16: error: unknown argument: '-fhonor-nan-compares'

This error is from file "quda/lib/targets/sycl/target_sycl.cmake" if("x${CMAKE_CXX_COMPILER_ID}" STREQUAL "xClang" OR 103 "x${CMAKE_CXX_COMPILER_ID}" STREQUAL "xIntelLLVM") 104 #target_compile_options(quda INTERFACE -fhonor-nan-compares) 105 #target_compile_options(quda PRIVATE -fhonor-nan-compares) 106 target_compile_options(quda PUBLIC -fhonor-nan-compares) >>>>>>>> as CLANG does not have support for this flag 107 target_compile_options(quda PUBLIC -Wno-tautological-constant-compare)

jcosborn commented 1 year ago

Thanks for reporting that. This is fixed now. I have successfully tested it on Intel, but had issues on NVIDIA.

maddyscientist commented 1 year ago

@jcosborn what are the issues on NVIDIA?

jcosborn commented 1 year ago

I get a bunch of errors like: ptxas error : Entry function '_ZTSZZN4quda6launchINS_9Kernel3DSINS_14dslash_functorENS_18dslash_functor_argINS_19domainWall4DFusedM5ENS_9packShmemELi2ELb0ELb1ELNS_10KernelTypeE5ENS_22DomainWall4DFusedM5ArgIsLi3ELi4EL21QudaReconstructType_s8ELNS_11Dslash5TypeE8EEEEELb0EEESB_EENSt9enable_ifIXntclsr6deviceE14use_kernel_argIT0_EEE11qudaError_tE4typeERKNS_12qudaStream_tERN4sycl3_V18nd_rangeILi3EEERKSE_ENKUlRNSM_7handlerEE_clEST_EUlNSM_7nd_itemILi3EEEE__with_offset' uses too much shared data (0x18000 bytes, 0xc000 max)

maddyscientist commented 1 year ago

Ok, it looks like you (or the SYCL backend) is using static shared memory as opposed to dynamic shared memory: the former has a limit of 48 KiB per thread block, the latter has a much larger limit (96 KiB on Volta, ~164 KiB on Ampere, ~228 KiB on Hopper). Is this something one has control of with SYCL on NVIDIA, or is it out of your hands?

jcosborn commented 1 year ago

I wasn't setting the compute capability before, I'm trying again with sm_80. I'm not sure what else I can change yet.

jxy commented 1 year ago

I though this line controls the size, no? https://github.com/lattice/quda/blob/aa2ea419ce0f6f78f842f85f40cb2a607944c957/include/targets/sycl/target_device.h#L196

maddyscientist commented 1 year ago

@jcosborn the compute capability shouldn't matter here as the static limit is 48 KiB for all CUDA GPUs since Fermi (2010). The fact that the compile throws this error indicates that static shared memory is being used as opposed to dynamic, and this is the first red flag here. For dynamic shared memory, the compiler doesn't know what the shared memory per block is so it can't throw an error like this.

At least with the CUDA target, with static shared memory, it doesn't surprise me an excess amount would be produced, as the SharedMemoryCacheHelper with a static allocation will request as much shared memory is required for the maximum block size (1024 threads).

jcosborn commented 1 year ago

Yes, it seems it will only use static shared memory: https://github.com/intel/llvm/pull/3329

I'll see what I can get to compile now, and look into setting a limit for it.

sy3394 commented 1 year ago

I have also several issues in compiling this branch of QUDA as well as some questions.

Questions:

  1. Do you assume the user compiles this software using dpcpp, in particualr the one from oneAPI-2022.1.0? I ask this question because some files include sycl/ext/oneapi/experimental/builtins.hpp, which can be found in the 2022 version of oneAPI distribution but not in the version 2021.2.0.
  2. What are the command line options you used when installing oneAPI? I am wondering this because lib/targets/sycl/blas_lapack_mkl.cpp includes a file oneapi/mkl.hpp when QUDA_NATIVE_LAPACK is set True, which is the default. I assume this is part of oneAPI as the path contains oneapi. However, I was not able to locate this file in my oneAPI distribution.

There are some error massges when I try to compile QUDA of this branch.

  1. lib/targets/sycl/device.cpp:105:91: error: 'max_work_item_sizes' does not name a template but is followed by template arguments. max_work_item_sizes is set in include/sycl/CL/sycl/info/info_desc.hpp from oneAPI to be max_work_item_sizes = CL_DEVICE_MAX_WORK_ITEM_SIZES. In turn, CL_DEVICE_MAX_WORK_ITEM_SIZES is set in include/sycl/CL/cl.h using #define. I'm not sure why I got this error. Is this due to incorrect installation of oneAPI or some missing command line argument for cmake when compiling QUDA?
  2. There are other errors like the one above such as lib/targets/sycl/device.cpp:81:16: error: no matching conversion for functional-style cast from 'int (*)(const sycl::device &)' to 'sycl::platform' These seem to suggest that I use sycl version or implementation different from what is assumed to be used for this branch of QUDA.
The list of similar errors ``` /cyclamen/home/syamamoto/src/quda_sycl/lib/targets/sycl/device.cpp:81:16: error: no matching conversion for functional-style cast from 'int (*)(const sycl::device &)' to 'sycl::platform' auto p = sycl::platform(mySelector); ^~~~~~~~~~~~~~~~~~~~~~~~~ /onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:56:12: note: candidate constructor not viable: no known conversion from 'int (*)(const sycl::device &)' to 'const sycl::device_selector' for 1st argument explicit platform(const device_selector &DeviceSelector); ^ /onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:58:3: note: candidate constructor not viable: no known conversion from 'int (*)(const sycl::device &)' to 'const sycl::platform' for 1st argument platform(const platform &rhs) = default; ^ /onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:60:3: note: candidate constructor not viable: no known conversion from 'int (*)(const sycl::device &)' to 'sycl::platform' for 1st argument platform(platform &&rhs) = default; ^ /onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:148:3: note: candidate constructor not viable: no known conversion from 'int (*)(const sycl::device &)' to 'std::shared_ptr' for 1st argument platform(std::shared_ptr impl) : impl(impl) {} ^ /onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:37:3: note: candidate constructor not viable: requires 0 arguments, but 1 was provided platform(); ^ /cyclamen/home/syamamoto/src/quda_sycl/lib/targets/sycl/device.cpp:105:110: error: invalid operands to binary expression ('sycl::info::device' and 'int') printfQuda(" Max work item sizes: %s\n", str(myDevice.get_info>()).c_str()); ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^ /cyclamen/home/syamamoto/src/quda_sycl/lib/../include/util_quda.h:91:29: note: expanded from macro 'printfQuda' sprintf(getPrintBuffer(), __VA_ARGS__); \ ^~~~~~~~~~~ /cyclamen/home/syamamoto/src/quda_sycl/lib/targets/sycl/device.cpp:105:91: error: 'max_work_item_sizes' does not name a template but is followed by template arguments printfQuda(" Max work item sizes: %s\n", str(myDevice.get_info>()).c_str()); ^ ~~~~ /cyclamen/home/syamamoto/src/quda_sycl/lib/../include/util_quda.h:91:29: note: expanded from macro 'printfQuda' sprintf(getPrintBuffer(), __VA_ARGS__); \ ^~~~~~~~~~~ /onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/info/info_desc.hpp:55:3: note: non-template declaration found by name lookup max_work_item_sizes = CL_DEVICE_MAX_WORK_ITEM_SIZES, ^ /cyclamen/home/syamamoto/src/quda_sycl/lib/targets/sycl/device.cpp:146:16: error: no matching conversion for functional-style cast from 'int (*)(const sycl::device &)' to 'sycl::platform' auto p = sycl::platform(mySelector); ^~~~~~~~~~~~~~~~~~~~~~~~~ /onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:56:12: note: candidate constructor not viable: no known conversion from 'int (*)(const sycl::device &)' to 'const sycl::device_selector' for 1st argument explicit platform(const device_selector &DeviceSelector); ^ /onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:58:3: note: candidate constructor not viable: no known conversion from 'int (*)(const sycl::device &)' to 'const sycl::platform' for 1st argument platform(const platform &rhs) = default; ^ /onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:60:3: note: candidate constructor not viable: no known conversion from 'int (*)(const sycl::device &)' to 'sycl::platform' for 1st argument platform(platform &&rhs) = default; ^ /onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:148:3: note: candidate constructor not viable: no known conversion from 'int (*)(const sycl::device &)' to 'std::shared_ptr' for 1st argument platform(std::shared_ptr impl) : impl(impl) {} ^ /onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:37:3: note: candidate constructor not viable: requires 0 arguments, but 1 was provided platform(); ^ /cyclamen/home/syamamoto/src/quda_sycl/lib/targets/sycl/device.cpp:154:16: error: no matching conversion for functional-style cast from 'int (*)(const sycl::device &)' to 'sycl::platform' auto p = sycl::platform(mySelector); ^~~~~~~~~~~~~~~~~~~~~~~~~ /onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:56:12: note: candidate constructor not viable: no known conversion from 'int (*)(const sycl::device &)' to 'const sycl::device_selector' for 1st argument explicit platform(const device_selector &DeviceSelector); ^ /onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:58:3: note: candidate constructor not viable: no known conversion from 'int (*)(const sycl::device &)' to 'const sycl::platform' for 1st argument platform(const platform &rhs) = default; ^ /onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:60:3: note: candidate constructor not viable: no known conversion from 'int (*)(const sycl::device &)' to 'sycl::platform' for 1st argument platform(platform &&rhs) = default; ^ /onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:148:3: note: candidate constructor not viable: no known conversion from 'int (*)(const sycl::device &)' to 'std::shared_ptr' for 1st argument platform(std::shared_ptr impl) : impl(impl) {} ^ /onyx/buildsets/eb_cyclamen/software/intel-compilers/2022.1.0/compiler/2022.1.0/linux/include/sycl/CL/sycl/platform.hpp:37:3: note: candidate constructor not viable: requires 0 arguments, but 1 was provided platform(); ^ /cyclamen/home/syamamoto/src/quda_sycl/lib/targets/sycl/device.cpp:161:17: error: no namespace named 'device' in namespace 'sycl::info'; did you mean simply 'device'? namespace id = sycl::info::device; ^~~~~~~~~~~~~~~~~~ device /cyclamen/home/syamamoto/src/quda_sycl/lib/targets/sycl/device.cpp:59:13: note: namespace 'device' defined here namespace device ^ /cyclamen/home/syamamoto/src/quda_sycl/lib/targets/sycl/device.cpp:164:81: error: no member named 'name' in namespace 'quda::device' printfQuda("%d - name: %s\n", device, d.get_info().c_str()); ~~~~^ /cyclamen/home/syamamoto/src/quda_sycl/lib/../include/util_quda.h:91:29: note: expanded from macro 'printfQuda' sprintf(getPrintBuffer(), __VA_ARGS__); \ ^~~~~~~~~~~ /cyclamen/home/syamamoto/src/quda_sycl/lib/targets/sycl/device.cpp:364:75: error: invalid operands to binary expression ('sycl::info::device' and 'int') auto val = myDevice.get_info>(); ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^ /cyclamen/home/syamamoto/src/quda_sycl/lib/targets/sycl/device.cpp:364:56: error: 'max_work_item ```
jcosborn commented 1 year ago

Yes, it generally requires the latest version of oneAPI (or intel-llvm). I'm currently testing with 2023.0.0. The issues you are seeing are due to differences in the older version of oneAPI.

sy3394 commented 1 year ago

Thank you for your prompt reply. I will install the new version and try it out.

Meanwhile, I have another simple question. I am trying to compile QUDA targeting SYCL because I want to use QUDA in the enviornment possibly without GPUs for testing purposes. Performance is not my main concern. I just need to run QUDA wihout GPUs. I assume this branch of QUDA works on CPUs. Am I correct?

jcosborn commented 1 year ago

Yes, it works with the opencl:cpu backend, though performance isn't very good.

li12242 commented 3 months ago

@jcosborn I have tried the following compiling commands, but have encountered some errors when linking.

It is essentially fully functional. Depending on which version of oneapi and hardware you run with there may be some issues though. It requires Intel SYCL since it uses some Intel extensions. I've only tried it on Intel hardware, but it might run with the CUDA backend for Intel LLVM as well. Note that there are some changes to follow the SYCL 2020 spec that are in the upstream Intel LLVM repo which I haven't updated the code for yet. It should work with the current public oneapi release though. An example build and test commands (which will need updating soon) are below.

export QUDA_TARGET=SYCL export CXX=dpcpp o="$o -DCMAKE_INSTALL_PREFIX=" o="$o -DQUDA_FAST_COMPILE_REDUCE=ON" o="$o -DQUDA_BUILD_SHAREDLIB=OFF" cmake $o

make make test

The following errors occur at linking. The compiler I used is icpx-2023.2.4, together with OneMKL-2023.0.0 version.

[ 82%] Linking CXX executable gauge_alg_test
icpx: warning: use of 'dpcpp' is deprecated and will be removed in a future release. Use 'icpx -fsycl' [-Wdeprecated]
/usr/bin/ld: ../lib/libquda.a(timer.cpp.o): unable to initialize decompress status for section .debug_loc
/usr/bin/ld: ../lib/libquda.a(timer.cpp.o): unable to initialize decompress status for section .debug_loc
../lib/libquda.a: error adding symbols: File format not recognized
icpx: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [tests/CMakeFiles/gauge_alg_test.dir/build.make:99: tests/gauge_alg_test] Error 1
make[1]: *** [CMakeFiles/Makefile2:1088: tests/CMakeFiles/gauge_alg_test.dir/all] Error 2
make: *** [Makefile:146: all] Error 2
li12242 commented 3 months ago

@jcosborn I have tried the following compiling commands, but have encountered some errors when linking.

It is essentially fully functional. Depending on which version of oneapi and hardware you run with there may be some issues though. It requires Intel SYCL since it uses some Intel extensions. I've only tried it on Intel hardware, but it might run with the CUDA backend for Intel LLVM as well. Note that there are some changes to follow the SYCL 2020 spec that are in the upstream Intel LLVM repo which I haven't updated the code for yet. It should work with the current public oneapi release though. An example build and test commands (which will need updating soon) are below. export QUDA_TARGET=SYCL export CXX=dpcpp o="$o -DCMAKE_INSTALL_PREFIX=" o="$o -DQUDA_FAST_COMPILE_REDUCE=ON" o="$o -DQUDA_BUILD_SHAREDLIB=OFF" cmake $o make make test

The following errors occur at linking. The compiler I used is icpx-2023.2.4, together with OneMKL-2023.0.0 version.

[ 82%] Linking CXX executable gauge_alg_test
icpx: warning: use of 'dpcpp' is deprecated and will be removed in a future release. Use 'icpx -fsycl' [-Wdeprecated]
/usr/bin/ld: ../lib/libquda.a(timer.cpp.o): unable to initialize decompress status for section .debug_loc
/usr/bin/ld: ../lib/libquda.a(timer.cpp.o): unable to initialize decompress status for section .debug_loc
../lib/libquda.a: error adding symbols: File format not recognized
icpx: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [tests/CMakeFiles/gauge_alg_test.dir/build.make:99: tests/gauge_alg_test] Error 1
make[1]: *** [CMakeFiles/Makefile2:1088: tests/CMakeFiles/gauge_alg_test.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

Sorry for the mistakes. I updated the binutils tools and the errors are disappeared.