idefix-code / idefix

A fast finite volume code designed to run on many architectures, such as GPU, CPU and manycores, using Kokkos.
https://idefix.readthedocs.io/
Other
28 stars 19 forks source link

Error on trying to build `Idefix` with gpu #9

Closed dutta-alankar closed 1 year ago

dutta-alankar commented 1 year ago

I'm trying to compile Idefix with the following cmake command for the Sod problem. Here's what I use:

cmake $IDEFIX_DIR   -DKokkos_ENABLE_CUDA=ON   -DKokkos_ARCH_AMPERE86=ON   -DCMAKE_CXX_COMPILER=g++

and this leads me to the following error:

nvcc fatal   : Value 'sm_35' is not defined for option 'gpu-architecture'
CMake Error at src/kokkos/cmake/kokkos_compiler_id.cmake:12 (STRING):
  STRING sub-command REPLACE requires at least four arguments.
Call Stack (most recent call first):
  src/kokkos/cmake/kokkos_compiler_id.cmake:45 (kokkos_internal_have_compiler_nvcc)
  src/kokkos/cmake/kokkos_tribits.cmake:204 (INCLUDE)
  src/kokkos/CMakeLists.txt:170 (KOKKOS_SETUP_BUILD_ENVIRONMENT)

-- SERIAL backend is being turned on to ensure there is at least one Host space. To change this, you must enable another host execution space and configure with -DKokkos_ENABLE_SERIAL=OFF or change CMakeCache.txt
-- Using -std=c++17 for C++17 standard as feature
CMake Error at src/kokkos/cmake/kokkos_test_cxx_std.cmake:132 (MESSAGE):
  Invalid compiler for CUDA.  The compiler must be nvcc_wrapper or Clang or
  use kokkos_launch_compiler, but compiler ID was GNU
Call Stack (most recent call first):
  src/kokkos/cmake/kokkos_tribits.cmake:231 (INCLUDE)
  src/kokkos/CMakeLists.txt:170 (KOKKOS_SETUP_BUILD_ENVIRONMENT)

I was able to get around this by changing src/kokkos/bin/nvcc_wrapper line 15 from default_arch="sm_35" to default_arch="compute_86". After this cmake succeeds to create the Makefile. But the subsequent make command gives the following error. Can you suggest a fix for this? I'm a newbie as far as using Idefix or Kokkos is concerned.

(base) alankar@juggernaut:sod$ make
[  1%] Building CXX object build/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_CPUDiscovery.cpp.o
[  2%] Building CXX object build/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_Core.cpp.o
/Data/alankar/work/idefix/src/kokkos/core/src/Cuda/Kokkos_Cuda_Half.hpp(936): error: no suitable user-defined conversion from "Kokkos::Experimental::half_t" to "__half" exists

/Data/alankar/work/idefix/src/kokkos/core/src/Cuda/Kokkos_Cuda_Half.hpp(941): error: identifier "__half2double" is undefined

/Data/alankar/work/idefix/src/kokkos/core/src/Cuda/Kokkos_Cuda_Half.hpp(946): error: no suitable user-defined conversion from "Kokkos::Experimental::half_t" to "__half" exists

/Data/alankar/work/idefix/src/kokkos/core/src/Cuda/Kokkos_Cuda_Half.hpp(952): error: no suitable user-defined conversion from "Kokkos::Experimental::half_t" to "__half" exists

/Data/alankar/work/idefix/src/kokkos/core/src/Cuda/Kokkos_Cuda_Half.hpp(957): error: no suitable user-defined conversion from "Kokkos::Experimental::half_t" to "__half" exists

/Data/alankar/work/idefix/src/kokkos/core/src/Cuda/Kokkos_Cuda_Half.hpp(962): error: no suitable user-defined conversion from "Kokkos::Experimental::half_t" to "__half" exists

/Data/alankar/work/idefix/src/kokkos/core/src/Cuda/Kokkos_Cuda_Half.hpp(967): error: no suitable user-defined conversion from "Kokkos::Experimental::half_t" to "__half" exists

/Data/alankar/work/idefix/src/kokkos/core/src/Cuda/Kokkos_Cuda_Half.hpp(973): error: no suitable user-defined conversion from "Kokkos::Experimental::half_t" to "__half" exists

/Data/alankar/work/idefix/src/kokkos/core/src/Cuda/Kokkos_Cuda_Half.hpp(76): warning #821-D: extern inline function "Kokkos::Experimental::cast_to_half(bool)" was referenced but not defined

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/Data/alankar/work/idefix/src/kokkos/core/src/Cuda/Kokkos_Cuda_Half.hpp(101): warning #821-D: extern inline function "Kokkos::Experimental::cast_from_half<T>(Kokkos::Experimental::half_t) [with T=bool]" was referenced but not defined

8 errors detected in the compilation of "/Data/alankar/work/idefix/src/kokkos/core/src/impl/Kokkos_Core.cpp".
make[2]: *** [build/kokkos/core/src/CMakeFiles/kokkoscore.dir/build.make:90: build/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_Core.cpp.o] Error 2
make[1]: *** [CMakeFiles/Makefile2:1171: build/kokkos/core/src/CMakeFiles/kokkoscore.dir/all] Error 2
make: *** [Makefile:136: all] Error 2
neutrinoceros commented 1 year ago

Hi @dutta-alankar , welcome to Idefix ! Can you specify the machine you are trying to compile on ? It is possible that Kokkos isn't properly detecting your target architecture, in which case the solution would be to enable it explicitly at configuration (using ccmake).

glesur commented 1 year ago

Hi @dutta-alankar , it's usually not a good idea to specify a compiler when using cuda since Kokkos internally assumes nvcc. I would therefore suggest to remove the -DCMAKE_CXX_COMPILER option.

If you do want to use g++ as the host compiler, then you should set the CXX environnement variable to g++ so that nvcc knows which host compiler it should use.

dutta-alankar commented 1 year ago

@glesur and @neutrinoceros Thanks for all the suggestions! I tried initially with just cmake $IDEFIX_DIR -DKokkos_ENABLE_CUDA=ON and ran into similar problems. So that is the reason I was trying to set a compiler. I also tried nvcc and kokkos/bin/nvcc_wrapper but no luck there. I was initially using cuda 12.0 and got the problem that I described earlier. Now I switched to cuda 11.6 and using cmake with just cmake $IDEFIX_DIR -DKokkos_ENABLE_CUDA=ON, I get a different error as follows during make:

/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’:
  435 |         function(_Functor&& __f)
      |                                                                                                                                                 ^ 
/usr/include/c++/11/bits/std_function.h:435:145: note:         ‘_ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’:
  530 |         operator=(_Functor&& __f)
      |                                                                                                                                                  ^ 
/usr/include/c++/11/bits/std_function.h:530:146: note:         ‘_ArgTypes’
make[2]: *** [build/kokkos/core/src/CMakeFiles/kokkoscore.dir/build.make:76: build/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_CPUDiscovery.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:1171: build/kokkos/core/src/CMakeFiles/kokkoscore.dir/all] Error 2
make: *** [Makefile:136: all] Error 2

My system:

CXX compiler: GCC 11.3.0

Cuda: 11.6 or 12.0

GPU: RTX 3090 or AMPERE_86

Also the line default_arch="sm_35" didn't require any change this time using cuda 11.6 for cmake to succeed unlike cuda 12.0.

dutta-alankar commented 1 year ago

I'm summarizing my findings here. In all cases I'm using cmake $IDEFIX_DIR -DKokkos_ENABLE_CUDA=ON and if it succeeds, I'm using make -j$(nproc).

-- SERIAL backend is being turned on to ensure there is at least one Host space. To change this, you must enable another host execution space and configure with -DKokkos_ENABLE_SERIAL=OFF or change CMakeCache.txt -- Using -std=c++17 for C++17 standard as feature CMake Error at src/kokkos/cmake/kokkos_test_cxx_std.cmake:132 (MESSAGE): Invalid compiler for CUDA. The compiler must be nvcc_wrapper or Clang or use kokkos_launch_compiler, but compiler ID was GNU Call Stack (most recent call first): src/kokkos/cmake/kokkos_tribits.cmake:231 (INCLUDE) src/kokkos/CMakeLists.txt:170 (KOKKOS_SETUP_BUILD_ENVIRONMENT)

-- Configuring incomplete, errors occurred!

- I changed line 15 in `src/kokkos/bin/nvcc_wrapper` from `default_arch="sm_35"` to `default_arch="compute_86"`. After this `cmake` succeeds to create the Makefile. But the subsequent `make` command gives the following error:

[ 1%] Building CXX object build/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_CPUDiscovery.cpp.o [ 2%] Building CXX object build/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_Core.cpp.o /Data/alankar/work/idefix/src/kokkos/core/src/Cuda/Kokkos_Cuda_Half.hpp(936): error: no suitable user-defined conversion from "Kokkos::Experimental::half_t" to "__half" exists

/Data/alankar/work/idefix/src/kokkos/core/src/Cuda/Kokkos_Cuda_Half.hpp(941): error: identifier "__half2double" is undefined

/Data/alankar/work/idefix/src/kokkos/core/src/Cuda/Kokkos_Cuda_Half.hpp(946): error: no suitable user-defined conversion from "Kokkos::Experimental::half_t" to "__half" exists

/Data/alankar/work/idefix/src/kokkos/core/src/Cuda/Kokkos_Cuda_Half.hpp(952): error: no suitable user-defined conversion from "Kokkos::Experimental::half_t" to "__half" exists

/Data/alankar/work/idefix/src/kokkos/core/src/Cuda/Kokkos_Cuda_Half.hpp(957): error: no suitable user-defined conversion from "Kokkos::Experimental::half_t" to "__half" exists

/Data/alankar/work/idefix/src/kokkos/core/src/Cuda/Kokkos_Cuda_Half.hpp(962): error: no suitable user-defined conversion from "Kokkos::Experimental::half_t" to "__half" exists

/Data/alankar/work/idefix/src/kokkos/core/src/Cuda/Kokkos_Cuda_Half.hpp(967): error: no suitable user-defined conversion from "Kokkos::Experimental::half_t" to "__half" exists

/Data/alankar/work/idefix/src/kokkos/core/src/Cuda/Kokkos_Cuda_Half.hpp(973): error: no suitable user-defined conversion from "Kokkos::Experimental::half_t" to "__half" exists

/Data/alankar/work/idefix/src/kokkos/core/src/Cuda/Kokkos_Cuda_Half.hpp(76): warning #821-D: extern inline function "Kokkos::Experimental::cast_to_half(bool)" was referenced but not defined

Remark: The warnings can be suppressed with "-diag-suppress "

/Data/alankar/work/idefix/src/kokkos/core/src/Cuda/Kokkos_Cuda_Half.hpp(101): warning #821-D: extern inline function "Kokkos::Experimental::cast_from_half(Kokkos::Experimental::half_t) [with T=bool]" was referenced but not defined

8 errors detected in the compilation of "/Data/alankar/work/idefix/src/kokkos/core/src/impl/Kokkos_Core.cpp". make[2]: [build/kokkos/core/src/CMakeFiles/kokkoscore.dir/build.make:90: build/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_Core.cpp.o] Error 2 make[1]: [CMakeFiles/Makefile2:1171: build/kokkos/core/src/CMakeFiles/kokkoscore.dir/all] Error 2 make: *** [Makefile:136: all] Error 2

- Using `cuda 11.6`, `cmake` can create `Makefile` with no modifications to `nvcc_wrapper`. However `make` fails with a different error message as follows:

[ 1%] Building CXX object build/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_CPUDiscovery.cpp.o /usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’: 435 | function(_Functor&& f) | ^ /usr/include/c++/11/bits/std_function.h:435:145: note: ‘_ArgTypes’ /usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’: 530 | operator=(_Functor&& f) | ^ /usr/include/c++/11/bits/std_function.h:530:146: note: ‘_ArgTypes’ make[2]: [build/kokkos/core/src/CMakeFiles/kokkoscore.dir/build.make:76: build/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_CPUDiscovery.cpp.o] Error 1 make[1]: [CMakeFiles/Makefile2:1171: build/kokkos/core/src/CMakeFiles/kokkoscore.dir/all] Error 2 make: *** [Makefile:136: all] Error 2

glesur commented 1 year ago

Thanks for your feedback @dutta-alankar . For cuda 12: it's a known incompatibility of the kokkos version that is pinned by Idefix. You can try to force update it by going to src/kokkos and git pull master. You should then get Kokkos 4.0 which is cuda 12 compatible.

For cuda 11.6: I still have to reproduce it. Do you have the nvidia hpc compiler suite installed ?

dutta-alankar commented 1 year ago

I have now tested cuda 11.6 and seems like the GNU Compiler 11.3.0 is incompatible and gives the error I mentioned earlier. I locally compiled GNU Compiler 9.5.0 and installed it at a non-standard location and updated my PATH and LD_LIBRARY_PATH in .bashrc and which g++ seems to give the 9.5.0 version but cmake was still detecting 11.3.0. So I ran cmake using the following:

CXX=g++ cmake $IDEFIX_DIR -DKokkos_ENABLE_CUDA=ON

After which, in the Makefile generation I found that the compiler 9.5.0 was getting detected. Now make went smoothly till 100% and then I landed in another error as follows:

nvlink fatal   : Could not open input file '/usr/lib/x86_64-linux-gnu/libdl.a'
make[2]: *** [CMakeFiles/idefix.dir/build.make:790: idefix] Error 1
make[1]: *** [CMakeFiles/Makefile2:417: CMakeFiles/idefix.dir/all] Error 2
make: *** [Makefile:136: all] Error 2

After searching a bit this now seemed to be an issue with cuda 11.6 but fixed in later releases of Cuda. Although, couldn't figure out the reason! See here:

After this, I changed to cuda 11.7 and only then was I able to compile idefix. Therefore the takeaway was GCC 9.5.0 and Cuda 11.7 was a working combination. @glesur Let me know if you can reproduce any of these issues or if you know the valid combination of cuda and compiler versions that are known to work.

glesur commented 1 year ago

I know that cuda 11.6 and gcc 9 works, but I've never tried with gcc 11. I'll double check this, thanks.

dutta-alankar commented 1 year ago

@glesur Thanks for your input. I updated Kokkos using git pull origin master inside src/kokkos and now Cuda 12.0 and GCC 11.3.0 is able to compile idefix.

UPDATE: Upto GCC 12.2.0 works with Cuda 12.0 with the updated Kokkos and idefix is getting compiled. GCC 13.x isn't working.

glesur commented 1 year ago

To conclude on using cuda 11.6, the issue you encountered is a known incompatibility between gcc11 and cuda 11.6, which was fixed in cuda 11.6.2. Possible workarounds are using gcc12 or cuda 11.6.2.

dutta-alankar commented 1 year ago

Not sure if gcc12 and cuda 11.6.2 are a compatible combination. https://gist.github.com/ax3l/9489132 Anyways, I have been able to get this working with Kokkos 4, Cuda 12.0 and GCC 12.2.0 and also using the detached head of kokkos @ a1d045d with Cuda 11.8 and GCC 9.5.0. This GitHub gist is perhaps an useful thing to refer to for anyone new wanting to compile and run idefix on GPU. @glesur Thanks again for your help. You may close this issue as it seems to addressed.