mir-group / phoebe

A high-performance framework for solving phonon and electron Boltzmann equations
https://mir-group.github.io/phoebe/
MIT License
83 stars 19 forks source link

Compile Error on CUDA 12.0 #184

Closed Youhaojen closed 1 year ago

Youhaojen commented 1 year ago

Dear Phoebe Developers,

I am writing to inform you that I encountered some difficulties while attempting to compile Phoebe using NVHPC 23.3 and CUDA 12.0 on an RTX 3090. Unfortunately, an error occurred during the process.

After investigating the issue, I suspect that the error may be caused by the fact that the nvcc version is too new for Phoebe to handle.

I kindly request that the Phoebe development team consider supporting higher versions of CUDA, as this would greatly benefit users who require newer hardware.

Thank you for your time and attention to this matter.

Hao-Jen You

Command line: cmake .. -DKokkos_ENABLE_CUDA=ON -DKokkos_ARCH_AMPERE86=ON -DOMP_AVAIL=ON

Error:

-- The CXX compiler identification is unknown
-- The Fortran compiler identification is NVHPC 23.3.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - failed
-- Check for working CXX compiler: /home/danken/File/phoebe/lib/kokkos/bin/nvcc_wrapper
-- Check for working CXX compiler: /home/danken/File/phoebe/lib/kokkos/bin/nvcc_wrapper - broken
CMake Error at /usr/share/cmake-3.25/Modules/CMakeTestCXXCompiler.cmake:63 (message):
  The C++ compiler

    "/home/danken/File/phoebe/lib/kokkos/bin/nvcc_wrapper"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: /home/danken/File/phoebe/build/CMakeFiles/CMakeScratch/TryCompile-CpKoQ5

    Run Build Command(s):/usr/bin/gmake -f Makefile cmTC_d601a/fast && /usr/bin/gmake  -f CMakeFiles/cmTC_d601a.dir/build.make CMakeFiles/cmTC_d601a.dir/build
    gmake[1]: Entering directory '/home/danken/File/phoebe/build/CMakeFiles/CMakeScratch/TryCompile-CpKoQ5'
    Building CXX object CMakeFiles/cmTC_d601a.dir/testCXXCompiler.cxx.o
    /home/danken/File/phoebe/lib/kokkos/bin/nvcc_wrapper    -o CMakeFiles/cmTC_d601a.dir/testCXXCompiler.cxx.o -c /home/danken/File/phoebe/build/CMakeFiles/CMakeScratch/TryCompile-CpKoQ5/testCXXCompiler.cxx
    nvcc fatal   : Value 'sm_35' is not defined for option 'gpu-architecture'
    gmake[1]: *** [CMakeFiles/cmTC_d601a.dir/build.make:78: CMakeFiles/cmTC_d601a.dir/testCXXCompiler.cxx.o] Error 1
    gmake[1]: Leaving directory '/home/danken/File/phoebe/build/CMakeFiles/CMakeScratch/TryCompile-CpKoQ5'
    gmake: *** [Makefile:127: cmTC_d601a/fast] Error 2
Youhaojen commented 1 year ago

The error has been fixed via a modified nvcc_wrapper file.

#default_arch="sm_35" to default_arch="sm_86"

and, then I run cmake .. -DKokkos_ENABLE_CUDA=ON -DKokkos_ARCH_AMPERE86=ON -DOMP_AVAIL=ON, a new error happened.

Following is the CMakeError.log.

Determining if the Fortran sgemm exists failed with the following output:
Change Dir: /home/danken/Downloads/phoebe/build/CMakeFiles/CMakeScratch/TryCompile-9SihMK

Run Build Command(s):/usr/bin/gmake -f Makefile cmTC_1ef58/fast && /usr/bin/gmake  -f CMakeFiles/cmTC_1ef58.dir/build.make CMakeFiles/cmTC_1ef58.dir/build
gmake[1]: Entering directory '/home/danken/Downloads/phoebe/build/CMakeFiles/CMakeScratch/TryCompile-9SihMK'
Building Fortran object CMakeFiles/cmTC_1ef58.dir/testFortranCompiler.f.o
/application/compiler/nvidia/hpc_sdk-23.3/Linux_x86_64/23.3/compilers/bin/nvfortran    -c /home/danken/Downloads/phoebe/build/CMakeFiles/CMakeScratch/TryCompile-9SihMK/testFortranCompiler.f -o CMakeFiles/cmTC_1ef58.dir/testFortranCompiler.f.o
Linking Fortran executable cmTC_1ef58
/usr/bin/cmake -E cmake_link_script CMakeFiles/cmTC_1ef58.dir/link.txt --verbose=1
/application/compiler/nvidia/hpc_sdk-23.3/Linux_x86_64/23.3/compilers/bin/nvfortran CMakeFiles/cmTC_1ef58.dir/testFortranCompiler.f.o -o cmTC_1ef58 
/usr/bin/ld: CMakeFiles/cmTC_1ef58.dir/testFortranCompiler.f.o: in function `MAIN_':
/home/danken/Downloads/phoebe/build/CMakeFiles/CMakeScratch/TryCompile-9SihMK/testFortranCompiler.f:4: undefined reference to `sgemm_'
gmake[1]: *** [CMakeFiles/cmTC_1ef58.dir/build.make:99: cmTC_1ef58] Error 2
gmake[1]: Leaving directory '/home/danken/Downloads/phoebe/build/CMakeFiles/CMakeScratch/TryCompile-9SihMK'
gmake: *** [Makefile:127: cmTC_1ef58/fast] Error 2

Determining if the Fortran cheev exists failed with the following output:
Change Dir: /home/danken/Downloads/phoebe/build/CMakeFiles/CMakeScratch/TryCompile-fYhobq

Run Build Command(s):/usr/bin/gmake -f Makefile cmTC_bd8dc/fast && /usr/bin/gmake  -f CMakeFiles/cmTC_bd8dc.dir/build.make CMakeFiles/cmTC_bd8dc.dir/build
gmake[1]: Entering directory '/home/danken/Downloads/phoebe/build/CMakeFiles/CMakeScratch/TryCompile-fYhobq'
Building Fortran object CMakeFiles/cmTC_bd8dc.dir/testFortranCompiler.f.o
/application/compiler/nvidia/hpc_sdk-23.3/Linux_x86_64/23.3/compilers/bin/nvfortran    -c /home/danken/Downloads/phoebe/build/CMakeFiles/CMakeScratch/TryCompile-fYhobq/testFortranCompiler.f -o CMakeFiles/cmTC_bd8dc.dir/testFortranCompiler.f.o
Linking Fortran executable cmTC_bd8dc
/usr/bin/cmake -E cmake_link_script CMakeFiles/cmTC_bd8dc.dir/link.txt --verbose=1
/application/compiler/nvidia/hpc_sdk-23.3/Linux_x86_64/23.3/compilers/bin/nvfortran CMakeFiles/cmTC_bd8dc.dir/testFortranCompiler.f.o -o cmTC_bd8dc  /usr/lib/x86_64-linux-gnu/libblas.so -lm -ldl 
/usr/bin/ld: CMakeFiles/cmTC_bd8dc.dir/testFortranCompiler.f.o: in function `MAIN_':
/home/danken/Downloads/phoebe/build/CMakeFiles/CMakeScratch/TryCompile-fYhobq/testFortranCompiler.f:4: undefined reference to `cheev_'
gmake[1]: *** [CMakeFiles/cmTC_bd8dc.dir/build.make:100: cmTC_bd8dc] Error 2
gmake[1]: Leaving directory '/home/danken/Downloads/phoebe/build/CMakeFiles/CMakeScratch/TryCompile-fYhobq'
gmake: *** [Makefile:127: cmTC_bd8dc/fast] Error 2
jcoulter12 commented 1 year ago

Hi Hao-Jen You,

Thanks very much for writing us about this. I'm glad you were able to get through the first part. We're aware there can be some difficulty when using the nvcc compilers. As a result, we've mostly recommended people use gcc or intel. If you have one of these, you may get things working faster by specifying them.

Though this is something I would like to fix, I actually don't have access to a system with nvcc, so perhaps we can work together to figure it out.

Would you mind sharing the output of your cmake run as a text file attachment? I want to see if cmake is picking up the wrong things somehow. I suppose right now, superficially, it looks like somehow you don't have an appropriate Fortran compiler.

Alternatively, the line

Determining if the Fortran sgemm exists failed with the following output:
Change Dir:  ...

Maybe implies you don't have permission to build the code wherever you're attempting to? I'm just making guesses based on this small snippet of information.

Thanks, and let's see if we can figure this out. Jenny Coulter

Youhaojen commented 1 year ago

Dear Jenny Coulter,

Thank you for your prompt response. I appreciate your willingness to work together on resolving this issue.

I am interested in using OpenACC in my VASP experiment as it has shown to improve performance. However, I have encountered an error message while trying to determine if the Fortran sgemm exists, which seems to be related to the BLAS library. Maybe I need to attempt to resolve this by compiling OpenBLAS via NVHPC (?

I have included two log files generated from running cmake .. -DKokkos_ENABLE_CUDA=ON -DKokkos_ARCH_AMPERE86=ON -DOMP_AVAIL=ON for your reference: CMakeOutput.log CMakeError.log

Additionally, I tried defining the compiler and running cmake .. -DKokkos_ENABLE_CUDA=ON -DKokkos_ARCH_AMPERE86=ON -DOMP_AVAIL=ON -DCMAKE_C_COMPILER=nvcc -DCMAKE_CXX_COMPILER=nvc++ -DCMAKE_Fortran_COMPILER=nvfortran, but the error seems to persist. I have attached three log files for your reference: compile.log CMakeOutput(1).log CMakeError(1).log

Best regards, Hao-Jen You

jcoulter12 commented 1 year ago

Hi, this is a bit confusing to me:

I am interested in using OpenACC in my VASP experiment as it has shown to improve performance. However, I have encountered an error message while trying to determine if the Fortran sgemm exists, which seems to be related to the BLAS library. Maybe I need to attempt to resolve this by compiling OpenBLAS via NVHPC (?

You mean Phoebe not vasp, right? :) Also, we don't use openACC, we use Kokkos. Kokkos can be compiled for GPU or CPU support. We have found the easiest way to build the code with GPU support is to use gcc or intel compilers, and specify: cmake .. -DKokkos_ENABLE_CUDA=ON -DKokkos_ARCH_AMPERE86=ON -DOMP_AVAIL=ON This will produce an executable that can use GPUs. If you have these compilers available, I would suggest this option. It will take advantage of your available cuda version.

While these error messages are useful, I was interested in what cmake actually prints out when you run it. Could you send me this?

Best, Jenny

Youhaojen commented 1 year ago

Thanks for your reply.

You mean Phoebe not vasp, right? :)

Yes

Also, we don't use openACC, we use Kokkos. Kokkos can be compiled for GPU or CPU support. We have found the easiest way to build the code with GPU support is to use gcc or intel compilers, and specify: cmake .. -DKokkos_ENABLE_CUDA=ON -DKokkos_ARCH_AMPERE86=ON -DOMP_AVAIL=ON This will produce an executable that can use GPUs. If you have these compilers available, I would suggest this option. It will take advantage of your available cuda version.

Based on my investigation, I have identified a few errors while compiling Phoebe. Upon reflection, I realized that during my previous attempt, I had utilized module load NVHPC. For your convenience, you can access the corresponding compile.log which contains information on lines 1 and 3.

-- The C compiler identification is NVHPC 23.3.0 -- The CXX compiler identification is GNU 12.2.0 -- The Fortran compiler identification is NVHPC 23.3.0

It appears that GCC and gfortran are not being utilized, and as a result, I have decided to disable the NVHPC environment by executing the command module unload NVHPC. I have also made some modifications to my environment with the hope of resolving any errors. Specifically, I have installed several packages using the following commands: sudo apt install cmake gcc doxygen graphviz libomp-dev libopenmpi3 libhdf5-openmpi-dev sudo apt install nvidia-cuda-toolkit

I would like to share some details about the current environment, which are outlined below:

Ubuntu 23.04 The C compiler identification is GNU 12.2.0 The CXX compiler identification is GNU 11.3.0 The Fortran compiler identification is GNU 12.2.0 CUDA version 11.8.89

To test whether the CPU version is able to compile, I ran the command cmake .. -DKokkos_ENABLE_OPENMP=ON -DOMP_AVAIL=ON -DCMAKE_CXX_STANDARD_LIBRARIES="-L/usr/lib/x86_64-linux-gnu/hdf5/openmpi/" -DCMAKE_CXX_FLAGS="-I/usr/include/hdf5/openmpi/". I am pleased to report that Phoebe was able to run successfully. However, it is worth noting that there were some errors with Cmake. cmake.log CMakeOutput.log CMakeError.log make.log

Following my earlier experience of CPU version, I attempted to compile the GPU version of Phoebe by running the command cmake .. -DKokkos_ENABLE_CUDA=ON -DKokkos_ARCH_AMPERE86=ON -DOMP_AVAIL=ON -DCMAKE_CXX_STANDARD_LIBRARIES="-L/usr/lib/x86_64-linux-gnu/hdf5/openmpi/" -DCMAKE_CXX_FLAGS="-I/usr/include/hdf5/openmpi/". Regrettably, the compilation was unsuccessful. I have included the log files for your reference below. cmake.log CMakeOutput.log CMakeError.log make.log

By the way, it is worth mentioning that the error present in the CMakeError.log files for both the CPU and GPU versions of Phoebe is identical.

Best regards, Hao-Jen You

jcoulter12 commented 1 year ago

Hi @Youhaojen,

Thanks very much, these details are closer to what I was looking for. Indeed, I wanted what you have in compile.log to look at exactly the lines you supplied, so I appreciate this update.

-- The C compiler identification is GNU 12.2.0 -- The CXX compiler identification is GNU 11.3.0 -- The Fortran compiler identification is GNU 12.2.0

In your GPU build, it seems you again have mismatched compilers. Do you know the root of this? That could certainly cause issues, as whatever OpenMPI, HDF5, etc you load needs to match the compiler. Is this a SLURM based cluster? I am wondering if CMake is finding system copies for some of the GNU compilers mixing them with some of the copies you've installed. Also, if it's a cluster, is there public documentation? That could help me figure things out.

Ubuntu is a bit quirky about parallel HDF5, as I think you see we noted on our Install page, as you've added these CXX library lines already.

Best, Jenny

Youhaojen commented 1 year ago

Hi Jenny,

I wanted to express my gratitude for your assistance. I'm happy to inform you that the GPU version of Phoebe has finally been successfully compiled and is now working perfectly.

-- The C compiler identification is GNU 12.2.0 -- The CXX compiler identification is GNU 11.3.0 -- The Fortran compiler identification is GNU 12.2.0

In your GPU build, it seems you again have mismatched compilers. Do you know the root of this? That could certainly cause issues, as whatever OpenMPI, HDF5, etc you load needs to match the compiler. Is this a SLURM based cluster? I am wondering if CMake is finding system copies for some of the GNU compilers mixing them with some of the copies you've installed. Also, if it's a cluster, is there public documentation? That could help me figure things out.

I compiled Phoebe on my personal computer. Although the mismatched compilers were not the main issue, I did encounter the following error in the last line (line 5) of the previous make.log for the GPU version: /usr/bin/ld: cannot find -lgfortran: No such file or directory. It seems that the libgfortran library cannot be located. To address this, I followed the same steps outlined in your installation tutorials for MacOS, where I defined the path by executing export LIBRARY_PATH=$LIBRARY_PATH:/path/to/libgfortran/. Subsequently, the GPU version of Phoebe compiled perfectly. For your convenience, I have included the log files for both the CPU and GPU versions below:

cmake-cpu.txt compile-cpu.txt cmake-gpu.txt compile-gpu.txt

Furthermore, I would like to provide some suggestions for the installation tutorials:

By the way, I was wondering if there might be a possibility to calculate four-phonon interactions using Phoebe in the future.

Lastly, I want to express my sincere appreciation for your invaluable assistance and outstanding work. The GPU version of Phoebe is truly exceptional and remarkably fast.

Best regards, Hao-Jen You

jcoulter12 commented 1 year ago

Hi @Youhaojen,

Yes, this is definitely important when compiling on Apple computers, though I'm surprised it was still relevant if you are running ubuntu.

Thanks for writing your suggestions, though there are a few reasons we have done things the way we have.

For CPU version users, it is important to ensure that the environment includes the following packages: sudo apt install cmake gcc doxygen graphviz libomp-dev libopenmpi3 libhdf5-openmpi-dev.

Yes, actually we mention all these things as dependencies. As most people will compile Phoebe on computing clusters, where sudo isn't an option, this won't work for most people. I can leave it as a note for those trying to build with ubuntu.

Unfortunately, only python codes can be built in python environments, and Phoebe is a c++ code, so this isn't possible.

Thanks also for registering your interest in 4 ph processes, we've heard this from a few people and it's very helpful to know what users want. This implementation could be a fair bit of work, but we're considering ways we might be able to do it.

We appreciate user feedback, and I'm very glad you were able to get things working! Let us know if you have any other questions.

Best, Jenny