spcl / dace

DaCe - Data Centric Parallel Programming
http://dace.is/fast
BSD 3-Clause "New" or "Revised" License
499 stars 130 forks source link

Cmake error for distributed polybench.py #1130

Open Mittagskogel opened 2 years ago

Mittagskogel commented 2 years ago

Running samples/distributed/polybench.py causes CMake to crash. The problem seems to occur in compilers.py when calling the env.cmake_compile_flags function in line 297.

Steps to reproduce the problem:

git clone --recursive https://github.com/spcl/dace.git 
virtualenv -p python dace_env
source ./dace_env/bin/activate
cd dace
ml gcc-11.3.0-gcc-11.3.0-rkggaw intel-mkl-2020.4.304-gcc-11.3.0-xxtaniu py-mpi4py-3.1.2-gcc-11.3.0-openmpi-u3yz3iy openmpi-4.1.4-gcc-11.3.0-rovfonw cmake-3.24.2-gcc-11.3.0-2y45yxw
pip install --editable .
python ./samples/distributed/polybench.py

Output for python ./samples/distributed/polybench.py:

===== atax =====
sizes: [20000, 25000]
adjusted sizes: (20000, 25000)
data initialized
-- The C compiler identification is GNU 11.3.0
-- The CXX compiler identification is GNU 11.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /cm/shared/apps/spack-stack/linux-rocky8-zen2/gcc-11.3.0/gcc-11.3.0-rkggaw2lju22imfhv77nqtu6uhcgyizv/bin/gcc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /cm/shared/apps/spack-stack/linux-rocky8-zen2/gcc-11.3.0/gcc-11.3.0-rkggaw2lju22imfhv77nqtu6uhcgyizv/bin/g++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found MPI_C: /home/user/spack/opt/spack/linux-rocky8-zen2/gcc-11.3.0/openmpi-4.1.4-rovfonwrs3dc4tpklzyfrjvthhlc6ze5/lib/libmpi.so (found version "3.1") 
-- Found MPI_CXX: /home/user/spack/opt/spack/linux-rocky8-zen2/gcc-11.3.0/openmpi-4.1.4-rovfonwrs3dc4tpklzyfrjvthhlc6ze5/lib/libmpi.so (found version "3.1") 
-- Found MPI: TRUE (found version "3.1")  
-- Configuring done
-- Generating done
-- Build files have been written to: /tmp/tmp3iuvu06n/build
-- Configuring done
-- Generating done
-- Build files have been written to: /tmp/tmp3iuvu06n/build
CMake Warning:
  No source or binary directory provided.  Both will be assumed to be the
  same as the current working directory, but note that this warning will
  become a fatal error in future CMake releases.

CMake Error: Generator implementation error, all generators must specify this->FindMakeProgramFile
 -Wl,-rpath,/cm/shared/apps/spack-stack/linux-rocky8-zen2/gcc-11.3.0/gcc-11.3.0-rkggaw2lju22imfhv77nqtu6uhcgyizv/lib/gcc/x86_64-pc-linux-gnu/11.3.0 -Wl,-rpath,/cm/shared/apps/spack-stack/linux-rocky8-zen2/gcc-11.3.0/gcc-11.3.0-rkggaw2lju22imfhv77nqtu6uhcgyizv/lib64 -Wl,-rpath -Wl,/home/user/spack/opt/spack/linux-rocky8-zen2/gcc-11.3.0/openmpi-4.1.4-rovfonwrs3dc4tpklzyfrjvthhlc6ze5/lib -Wl,-rpath -Wl,/home/user/spack/opt/spack/linux-rocky8-zen2/gcc-11.3.0/hwloc-2.8.0-xda3dbqr64n3rm7ths6ofk3asz2uiyog/lib -Wl,-rpath -Wl,/usr/lib64 -Wl,-rpath -Wl,/home/user/spack/opt/spack/linux-rocky8-zen2/gcc-11.3.0/libevent-2.1.12-cwyznrpoi6aozhji3z7qa55gmwe2sufm/lib -Wl,-rpath -Wl,/home/user/spack/opt/spack/linux-rocky8-zen2/gcc-11.3.0/pmix-4.1.2-jzw7xx4njmyz3g7h3i6qblhycwxbhs3u/lib -Wl,-rpath -Wl,/cm/shared/apps/slurm/21.08.8/lib64 -L/home/user/spack/opt/spack/linux-rocky8-zen2/gcc-11.3.0/hwloc-2.8.0-xda3dbqr64n3rm7ths6ofk3asz2uiyog/lib -L/usr/lib64 -L/home/user/spack/opt/spack/linux-rocky8-zen2/gcc-11.3.0/libevent-2.1.12-cwyznrpoi6aozhji3z7qa55gmwe2sufm/lib -L/home/user/spack/opt/spack/linux-rocky8-zen2/gcc-11.3.0/pmix-4.1.2-jzw7xx4njmyz3g7h3i6qblhycwxbhs3u/lib -L/cm/shared/apps/slurm/21.08.8/lib64 -pthread -L /home/user/spack/opt/spack/linux-rocky8-zen2/gcc-11.3.0/openmpi-4.1.4-rovfonwrs3dc4tpklzyfrjvthhlc6ze5/lib -lmpi
CMake Error: The source directory "/tmp/tmp3iuvu06n/build" does not appear to contain CMakeLists.txt.
Specify --help for usage, or press the help button on the CMake GUI.
[racklette1:2443130:0:2443130] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x440000e8)
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
==== backtrace (tid:2443130) ====
 0 0x0000000000012ce0 __funlockfile()  :0
 1 0x00000000000a4cd7 MPI_Comm_size()  ???:0
 2 0x0000000000029c59 MKLMPI_Comm_size()  ???:0
 3 0x0000000000027f51 mkl_blacs_init()  ???:0
 4 0x0000000000018ab8 blacs_pinfo_()  ???:0
 5 0x00000000000017c5 __dace_init_atax()  ???:0
 6 0x0000000000009052 ffi_call_unix64()  :0
 7 0x0000000000007ebb ffi_call_int()  ffi64.c:0
 8 0x0000000000012575 _ctypes_callproc()  :0
 9 0x000000000000c550 PyCFuncPtr_call()  _ctypes.c:0
10 0x00000000000c04b4 _PyObject_Call()  ???:0
11 0x000000000006f1bb _PyEval_EvalFrameDefault()  ???:0
12 0x0000000000069609 function_code_fastcall()  call.c:0
13 0x0000000000071442 _PyEval_EvalFrameDefault()  ???:0
14 0x00000000001ad2ec _PyEval_EvalCode()  :0
15 0x00000000000c0621 _PyFunction_Vectorcall()  ???:0
16 0x00000000000c0a9e _PyObject_FastCallDictTstate()  ???:0
17 0x00000000000c0cf0 _PyObject_Call_Prepend()  ???:0
18 0x0000000000125de8 slot_tp_call()  typeobject.c:0
19 0x00000000000c08b0 _PyObject_MakeTpCall()  ???:0
20 0x0000000000071e00 _PyEval_EvalFrameDefault()  ???:0
21 0x00000000001ad2ec _PyEval_EvalCode()  :0
22 0x00000000000c0621 _PyFunction_Vectorcall()  ???:0
23 0x0000000000070d76 _PyEval_EvalFrameDefault()  ???:0
24 0x00000000001ad2ec _PyEval_EvalCode()  :0
25 0x00000000001ad7de _PyEval_EvalCodeWithName()  ???:0
26 0x00000000001ad82b PyEval_EvalCodeEx()  ???:0
27 0x00000000001ad85b PyEval_EvalCode()  ???:0
28 0x00000000001eee3e run_mod()  pythonrun.c:0
29 0x00000000001f0901 PyRun_SimpleFileExFlags()  ???:0
30 0x000000000020e8cf Py_RunMain()  ???:0
31 0x000000000020ed97 Py_BytesMain()  ???:0
32 0x000000000003acf3 __libc_start_main()  ???:0
33 0x0000000000400f3e _start()  ???:0
=================================
./install_dace.sh: line 7: 2443130 Segmentation fault      (core dumped) python ./samples/distributed/polybench.py
alexnick83 commented 2 years ago

There is a similar issue here. This may be a configuration issue regarding the path to the .dacecache. Have you tried running something simple, e.g., samples/simple/axpy.py?

Also, I suggest you don't use the combination MKL+OpenMPI, but if you want to try, you will need to set the PBLAS default library implementation accordingly.

Mittagskogel commented 2 years ago

axpy.py seems to work fine:

(dace_env) [user@cluster dace]$ python ./samples/simple/axpy.py 
Difference: 0.0

I've now switched to MKL+MPICH, but I don't think this is relevant for the issue. Also, adding the default_build_folder configuration option doesn't change anything:

(dace_env) [user@cluster dace]$ export DACE_default_build_folder=.dacecache
(dace_env) [user@cluster dace]$ python ./samples/distributed/polybench.py 
===== atax =====
sizes: [20000, 25000]
adjusted sizes: (20000, 25000)
data initialized
-- The C compiler identification is GNU 11.3.0
-- The CXX compiler identification is GNU 11.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /cm/shared/apps/spack-stack/linux-rocky8-zen2/gcc-11.3.0/gcc-11.3.0-rkggaw2lju22imfhv77nqtu6uhcgyizv/bin/gcc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /cm/shared/apps/spack-stack/linux-rocky8-zen2/gcc-11.3.0/gcc-11.3.0-rkggaw2lju22imfhv77nqtu6uhcgyizv/bin/g++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found MPI_C: /home/user/spack/opt/spack/linux-rocky8-zen2/gcc-11.3.0/mpich-4.0.2-j3plqofcp37hfnmsnd3brbszqmhgjppu/lib/libmpi.so (found version "4.0") 
-- Found MPI_CXX: /home/user/spack/opt/spack/linux-rocky8-zen2/gcc-11.3.0/mpich-4.0.2-j3plqofcp37hfnmsnd3brbszqmhgjppu/lib/libmpicxx.so (found version "4.0") 
-- Found MPI: TRUE (found version "4.0")  
-- Configuring done
-- Generating done
-- Build files have been written to: /tmp/tmpsobb3aj9/build
-- Configuring done
-- Generating done
-- Build files have been written to: /tmp/tmpsobb3aj9/build
CMake Warning:
  No source or binary directory provided.  Both will be assumed to be the
  same as the current working directory, but note that this warning will
  become a fatal error in future CMake releases.

CMake Error: Generator implementation error, all generators must specify this->FindMakeProgramFile
 -Wl,-rpath -Wl,/home/user/spack/opt/spack/linux-rocky8-zen2/gcc-11.3.0/mpich-4.0.2-j3plqofcp37hfnmsnd3brbszqmhgjppu/lib -L /home/user/spack/opt/spack/linux-rocky8-zen2/gcc-11.3.0/mpich-4.0.2-j3plqofcp37hfnmsnd3brbszqmhgjppu/lib -lmpicxx -L /home/user/spack/opt/spack/linux-rocky8-zen2/gcc-11.3.0/mpich-4.0.2-j3plqofcp37hfnmsnd3brbszqmhgjppu/lib -lmpi
CMake Error: The source directory "/tmp/tmpsobb3aj9/build" does not appear to contain CMakeLists.txt.
Specify --help for usage, or press the help button on the CMake GUI.
Abort(605670927): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(59)....: MPI_Init(argc=0x7fffffff8cd0, argv=0x7fffffff8cc8) failed
MPII_Init_thread(209): 
MPID_Init(359).......: 
MPIR_pmi_init(141)...: PMI2_Job_GetId returned 14

I believe that this is because CMake isn't failing during an actual build step, it's failing while trying to detect the linker flags, so perhaps this configuration option doesn't apply.

alexnick83 commented 2 years ago

That makes sense, then. Can you check that you are using a mpi4py build that is compatible with MPICH? If you are using the same as above (py-mpi4py-3.1.2-gcc-11.3.0-openmpi-u3yz3iy), then this is the most likely cause of the error.

The issue with MKL+OpenMPI is that it works only with Intel's static BLACS library (see here), which may not even be installed in your system.

Mittagskogel commented 2 years ago

Indeed, I've finally gotten the problem to run with MKL+MPICH. Like you suggested, the py-mpi4py module was still incorrect. Thus, the CMake error message

CMake Error: Generator implementation error, all generators must specify this->FindMakeProgramFile
 -Wl,-rpath -Wl,/home/user/spack/opt/spack/linux-rocky8-zen2/gcc-11.3.0/mpich-4.0.2-j3plqofcp37hfnmsnd3brbszqmhgjppu/lib -L /home/user/spack/opt/spack/linux-rocky8-zen2/gcc-11.3.0/mpich-4.0.2-j3plqofcp37hfnmsnd3brbszqmhgjppu/lib -lmpicxx -L /home/user/spack/opt/spack/linux-rocky8-zen2/gcc-11.3.0/mpich-4.0.2-j3plqofcp37hfnmsnd3brbszqmhgjppu/lib -lmpi
CMake Error: The source directory "/tmp/tmpv_4lp_8g/build" does not exist.
Specify --help for usage, or press the help button on the CMake GUI.

seems to be a huge red herring; we've been trying to fix the wrong problem all along. How can we force CMake to give more verbose output about linking errors in this step?

alexnick83 commented 2 years ago

I understand what is going on now. We have made a separate CMake script that is indeed making a pseudo-project in a tmp directory to get the correct path and names to the MPICH libraries, e.g., on Cray machines. It seems that we haven't made a complete script, which is probably the reason for these errors. Does the program run now (regardless of whether the error still appears)? Could you tell me your CMake version, so I can try to reproduce and fix it?

Mittagskogel commented 2 years ago

Yes, the distributed benchmark is running regardless of the CMake errors. I'm using CMake 3.24.2, but I've tested multiple versions and they all triggered the error.

sumuzhe317 commented 2 years ago

We seem to have the same problem, although I confirm that I have set the correct default build folder. In addition, we observe that .dacecache folder is created correctly in the current directory and contains the atax subdirectory. I have tried to replicate on other machines. On that machine, when I use MPICH, although cmakelists are still written to the tmp directory, it can correctly run a complete polybench.

alexnick83 commented 2 years ago

We will fix this issue. In the meantime, if you are from the Student Cluster Competition, I made a post that describes the issue in more detail, in case you want/need to make any amendments to the relevant files.

tbennun commented 2 weeks ago

@alexnick83 Was this fixed?