nv-legate / legate.core

The Foundation for All Legate Libraries
https://docs.nvidia.com/legate/24.06/
Apache License 2.0
189 stars 63 forks source link

Build from source on PowerPC 9: no cuda-version package #905

Closed CharlelieLrt closed 10 months ago

CharlelieLrt commented 11 months ago

I am trying to build legate from source on Lassen (PowerPC9, OS: RHEL 7.9 Maipo) following instructions in the quickstart.

I generate a config file for my conda environment with ./scripts/generate-conda-envs.py --python 3.10 --ctk 11.8 --os linux. The config file is:

name: legate-test
channels:
  - conda-forge
  - nvidia
dependencies:

  - python=3.10,!=3.9.7  # avoid https://bugs.python.org/issue45121

  # cuda
  - cuda-version=11.8
  - cutensor>=1.3.3,<2
  - nccl
  - pynvml

  # build
  - cmake>=3.24,!=3.25.0
  - cython
  - elfutils
  - git
  - make
  - ninja
  - numba
  - openssl
  - pkg-config
  - rust
  - scikit-build>=0.13.1
  - setuptools>=60
  - zlib

  # runtime
  - cffi
  - llvm-openmp
  - numpy>=1.22
  - libblas=*=*openblas*
  - openblas=*=*openmp*
  - openblas<=0.3.21
  - opt_einsum
  - scipy
  - typing_extensions

  # tests
  - clang-tools>=8
  - clang>=8
  - colorama
  - coverage
  - mock
  - mypy>=0.961
  - pre-commit
  - pytest-cov
  - pytest-lazy-fixture
  - pytest-mock
  - pytest
  - types-docutils
  - pynvml
  - tifffile

  # docs
  - pandoc
  - doxygen
  - ipython
  - jinja2
  - markdown<3.4.0
  - pydata-sphinx-theme>=0.13
  - myst-parser
  - nbsphinx
  - sphinx-copybutton
  - sphinx>=4.4.0

Trying to create the environmnet gives the following error:

ResolvePackageNotFound: 
  - pydata-sphinx-theme[version='>=0.13']
  - cuda-version=11.8

In addition, running conda search cuda-version -c nvidia -c conda-forge suggests that the cuda-version package does not exist in these channels.

manopapad commented 11 months ago

@CharlelieLrt can you share the full output from the command, and also conda --version? Can you also try with mamba?

The missing packages are not under linux-ppc64le, but they are under noarch, so that should be sufficient for conda to find and use them, even if you're on a PowerPC platform. @m3vaz any idea what might have happened here?

CharlelieLrt commented 11 months ago

Version is conda 4.6.14 I switched to mamba and it could find the cuda-version package. I could then install dependencies (except the ones for the docs, but I won't need them).

I am now trying to install legate with:

./install.py --cuda --arch volta --network gasnet1 --max-dim 5 --openmp --hdf5 --build-tests --build-examples --conduit ibv, but I get an error telling me that the version of cmake I am using is incompatible:

CMake Error at CMakeLists.txt:17 (cmake_minimum_required):
CMake 3.22.1 or higher is required.  You are running version 3.17.5

My PATH is:

/g/g92/laurent3/miniforge3/envs/legate_base/bin:...

So, I looked at the cmake I have there, and /g/g92/laurent3/miniforge3/envs/legate_base/bin/cmake --version shows cmake version 3.27.9. On the contrary, the command cmake3 --version shows cmake3 version 3.17.5 which is installed in /usr/bin/cmake3. So, I assume that install.py is trying to use this system-wide install of cmake instead of the one in my conda environment. I tried providing an extra argument --with-cmake /g/g92/laurent3/miniforge3/envs/legate_base/bin/cmake to install.py, but it did not change anything.

I believe this was mentioned in #837

manopapad commented 10 months ago

I pushed a fix here, could you please try that? https://github.com/nv-legate/legate.core/pull/908

CharlelieLrt commented 10 months ago

It did not solve the problem. Now I see:

[...]
conduit: ibv
gasnet_system: None
nccl_dir: None
cmake_exe: /g/g92/laurent3/miniforge3/envs/legate_base/bin/cmake
cmake_generator: Ninja
[...]

But later on:

  Configuring Project
    Working directory:
      /usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build
    Command:
      /usr/bin/cmake3 /usr/WS1/laurent3/Codes/LEGATE/legate.core -G Ninja [...]

So it's still trying to use the system's cmake3

Could it be because pip --global-option is depreceated? (https://github.com/pypa/pip/issues/11859)

CharlelieLrt commented 10 months ago

As a temporary workaround I have defined a symlink for cmake3 to the right cmake. I am now running into a cuda compilation error:

      Finished release [optimized] target(s) in 2m 44s
  [98/261] /usr/tce/packages/cuda/cuda-12.0.0/bin/nvcc -forward-unknown-to-host-compiler -DLEGATE_USE_CUDA -DLEGATE_USE_NETWORK -DLEGATE_USE_OPENMP -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_CPP -DUSE_CUDA -DUSE_HDF -Dlegate_core_EXPORTS -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/src -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/legion-src/runtime -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/legion-src/runtime/mappers -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/legion-build/runtime -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/thrust-src -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/thrust-src/dependencies/cub -isystem /g/g92/laurent3/miniforge3/envs/legate_base/include -isystem /usr/tce/packages/cuda/cuda-12.0.0/nvidia/include -isystem /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-2020.08.19/include -isystem /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/include -O2 -std=c++17 "--generate-code=arch=compute_70,code=[sm_70]" -Xcompiler=-fPIC -Xfatbin=-compress-all --expt-extended-lambda --expt-relaxed-constexpr -Wno-deprecated-gpu-targets -Xcompiler -pthread -MD -MT legate-core-cpp/CMakeFiles/legate_core.dir/src/core/cuda/stream_pool.cu.o -MF legate-core-cpp/CMakeFiles/legate_core.dir/src/core/cuda/stream_pool.cu.o.d -x cu -c /usr/WS1/laurent3/Codes/LEGATE/legate.core/src/core/cuda/stream_pool.cu -o legate-core-cpp/CMakeFiles/legate_core.dir/src/core/cuda/stream_pool.cu.o
  FAILED: legate-core-cpp/CMakeFiles/legate_core.dir/src/core/cuda/stream_pool.cu.o
  /usr/tce/packages/cuda/cuda-12.0.0/bin/nvcc -forward-unknown-to-host-compiler -DLEGATE_USE_CUDA -DLEGATE_USE_NETWORK -DLEGATE_USE_OPENMP -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_CPP -DUSE_CUDA -DUSE_HDF -Dlegate_core_EXPORTS -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/src -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/legion-src/runtime -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/legion-src/runtime/mappers -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/legion-build/runtime -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/thrust-src -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/thrust-src/dependencies/cub -isystem /g/g92/laurent3/miniforge3/envs/legate_base/include -isystem /usr/tce/packages/cuda/cuda-12.0.0/nvidia/include -isystem /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-2020.08.19/include -isystem /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/include -O2 -std=c++17 "--generate-code=arch=compute_70,code=[sm_70]" -Xcompiler=-fPIC -Xfatbin=-compress-all --expt-extended-lambda --expt-relaxed-constexpr -Wno-deprecated-gpu-targets -Xcompiler -pthread -MD -MT legate-core-cpp/CMakeFiles/legate_core.dir/src/core/cuda/stream_pool.cu.o -MF legate-core-cpp/CMakeFiles/legate_core.dir/src/core/cuda/stream_pool.cu.o.d -x cu -c /usr/WS1/laurent3/Codes/LEGATE/legate.core/src/core/cuda/stream_pool.cu -o legate-core-cpp/CMakeFiles/legate_core.dir/src/core/cuda/stream_pool.cu.o
  /usr/include/sys/platform/ppc.h(31): error: identifier "__builtin_ppc_get_timebase" is undefined

I am loading cuda 12.0.0 with module load cuda/12.0.0 and my conda environment was generated with --ctk 12.0

manopapad commented 10 months ago

So it's still trying to use the system's cmake3 Could it be because pip --global-option is depreceated? (https://github.com/pypa/pip/issues/11859)

I posted some follow-up comments on #908. This falls beyond my (very limited) knowledge around python packaging.

error: identifier "__builtin_ppc_get_timebase" is undefined

What host compiler are you using? if you try compiling an empty file with /usr/tce/packages/cuda/cuda-12.0.0/bin/nvcc --verbose empty.cu you should be able to see what's getting called. E.g. on my local machine I see

#$ gcc -D__CUDA_ARCH_LIST__=520 -E -x c++ -D__CUDACC__ -D__NVCC__  "-I/usr/local/cuda/bin/../targets/x86_64-linux/include"    -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=3 -D__CUDACC_VER_BUILD__=103 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=3 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "a.cu" -o "/tmp/tmpxft_0023d560_00000000-5_a.cpp4.ii"

Are the pure-C++ files compiling correctly? What compiler are they using?

CharlelieLrt commented 10 months ago

Trying to compile empty.cu, I get:

#$ gcc -D__NV_NO_HOST_COMPILER_CHECK=1 -std=c++14 -D__CUDA_ARCH_LIST__=520 -E -x c++ -D__CUDACC__ -D__NVCC__  "-I/usr/tce/packages/cuda/cuda-12.0.0/nvidia/bin/../targets/ppc64le-linux/include"    -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=0 -D__CUDACC_VER_BUILD__=76 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=0 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" "empty.cu" -o "/var/tmp/laurent3/tmpxft_00003738_00000000-5_empty.cpp4.ii"

The pure C++ files seem to be compiled correctly. They use /usr/tce/packages/gcc/gcc-8.3.1/bin/c++ (c++ (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)).

manopapad commented 10 months ago

I reached out to compiler experts inside Nvidia and on the Legion Zulip for guidance. Unfortunately I don't have easy access to ppc64le machines to try and personally reproduce.

CharlelieLrt commented 10 months ago

So, given the comments on the Legion Zulip, I switched to a newer commit of Legion (d7121f886127e41773a283cbbaa51c452cd01054) that includes the fix for the __builtin_ppc_get_timebase error.

I now have a bunch of failed compilation, such as:

  FAILED: legate-core-cpp/CMakeFiles/legate_core.dir/src/core/task/variant_options.cc.o
  /usr/tce/packages/gcc/gcc-8.3.1/bin/c++ -DLEGATE_USE_COLLECTIVE -DLEGATE_USE_CUDA -DLEGATE_USE_NETWORK -DLEGATE_USE_OPENMP -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_CPP -DUSE_CUDA -DUSE_HDF -Dlegate_core_EXPORTS -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/src -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/legion-src/runtime -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/legion-src/runtime/mappers -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/legion-build/runtime -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/thrust-src -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/thrust-src/dependencies/cub -isystem /g/g92/laurent3/miniforge3/envs/legate_base/include -isystem /usr/tce/packages/cuda/cuda-12.0.0/nvidia/include -isystem /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-2020.08.19/include -isystem /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/include -O2 -std=gnu++17 -fPIC -mcpu=native -maltivec -mabi=altivec -mvsx -UTHRUST_DEVICE_SYSTEM -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_OMP -pthread -MD -MT legate-core-cpp/CMakeFiles/legate_core.dir/src/core/task/variant_options.cc.o -MF legate-core-cpp/CMakeFiles/legate_core.dir/src/core/task/variant_options.cc.o.d -o legate-core-cpp/CMakeFiles/legate_core.dir/src/core/task/variant_options.cc.o -c /usr/WS1/laurent3/Codes/LEGATE/legate.core/src/core/task/variant_options.cc
  /usr/WS1/laurent3/Codes/LEGATE/legate.core/src/core/task/variant_options.cc: In member function 'void legate::VariantOptions::populate_registrar(Legion::TaskVariantRegistrar&)':
  /usr/WS1/laurent3/Codes/LEGATE/legate.core/src/core/task/variant_options.cc:56:13: error: 'struct Legion::TaskVariantRegistrar' has no member named 'set_concurrent'; did you mean 'add_constraint'?
     registrar.set_concurrent(concurrent);
               ^~~~~~~~~~~~~~
               add_constraint

Or:

  FAILED: legate-core-cpp/CMakeFiles/legate_core.dir/src/core/comm/comm_nccl.cu.o
  /usr/tce/packages/cuda/cuda-12.0.0/bin/nvcc -forward-unknown-to-host-compiler -DLEGATE_USE_CUDA -DLEGATE_USE_NETWORK -DLEGATE_USE_OPENMP -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_CPP -DUSE_CUDA -DUSE_HDF -Dlegate_core_EXPORTS -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/src -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/legion-src/runtime -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/legion-src/runtime/mappers -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/legion-build/runtime -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/thrust-src -I/usr/WS1/laurent3/Codes/LEGATE/legate.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/thrust-src/dependencies/cub -isystem /g/g92/laurent3/miniforge3/envs/legate_base/include -isystem /usr/tce/packages/cuda/cuda-12.0.0/nvidia/include -isystem /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-2020.08.19/include -isystem /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/include -O2 -std=c++17 "--generate-code=arch=compute_70,code=[sm_70]" -Xcompiler=-fPIC -Xfatbin=-compress-all --expt-extended-lambda --expt-relaxed-constexpr -Wno-deprecated-gpu-targets -Xcompiler -pthread -MD -MT legate-core-cpp/CMakeFiles/legate_core.dir/src/core/comm/comm_nccl.cu.o -MF legate-core-cpp/CMakeFiles/legate_core.dir/src/core/comm/comm_nccl.cu.o.d -x cu -c /usr/WS1/laurent3/Codes/LEGATE/legate.core/src/core/comm/comm_nccl.cu -o legate-core-cpp/CMakeFiles/legate_core.dir/src/core/comm/comm_nccl.cu.o
  /usr/WS1/laurent3/Codes/LEGATE/legate.core/src/core/data/store.h(174): error: namespace "Legion" has no member "OutputRegion"

  /usr/WS1/laurent3/Codes/LEGATE/legate.core/src/core/data/store.h(205): error: namespace "Legion" has no member "OutputRegion"

  /usr/WS1/laurent3/Codes/LEGATE/legate.core/src/core/utilities/deserializer.h(107): error: namespace "Legion" has no member "OutputRegion"

  /usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/ptr_traits.h(114): error: static assertion failed with "pointer type defines element_type or is like SomePointer<T, Args>"
            detected during:
              instantiation of class "std::pointer_traits<_Ptr> [with _Ptr=<error-type> *]"
  /usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/alloc_traits.h(102): here
              instantiation of class "std::allocator_traits<_Alloc>::_Ptr<_Func, _Tp, <unnamed>> [with _Alloc=std::allocator<<error-type>>, _Func=std::__allocator_traits_base::__c_pointer, _Tp=const <error-type>, <unnamed>=void]"
  /usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/alloc_traits.h(135): here
              instantiation of class "std::allocator_traits<_Alloc> [with _Alloc=std::allocator<<error-type>>]"
  /usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/ext/alloc_traits.h(52): here
              instantiation of class "__gnu_cxx::__alloc_traits<_Alloc, <unnamed>> [with _Alloc=std::allocator<<error-type>>, <unnamed>=<error-type>]"
  /usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/stl_vector.h(84): here
              instantiation of class "std::_Vector_base<_Tp, _Alloc> [with _Tp=<error-type>, _Alloc=std::allocator<<error-type>>]"
  /usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/stl_vector.h(339): here
              instantiation of class "std::vector<_Tp, _Alloc> [with _Tp=<error-type>, _Alloc=std::allocator<<error-type>>]"
  /usr/WS1/laurent3/Codes/LEGATE/legate.core/src/core/utilities/deserializer.h(107): here

(and many other)

manopapad commented 10 months ago

Can you try with top-of-tree control_replication branch?

CharlelieLrt commented 10 months ago

Legion commit 04ee5be1dc3b742f195348c78458450f5dd35f44 worked, and no further problem to compile cunumeric, so everything is good (except the few things already mentioned above).

Thanks for your help with this!