Open charlesbluca opened 1 month ago
As far as I remember, you need to install nvcc_linux-64=11.8
, could you check if that works?
Installing that, it seems like the resulting nvcc
bin is just wrapping what I assume is a system installation of nvcc
?
→ conda list nvcc
# packages in environment at /home/charlesb/miniforge3/envs/ucxx-cuda-118:
#
# Name Version Build Channel
nvcc_linux-64 11.8 h9852d18_24 conda-forge
→ nvcc
/home/charlesb/miniforge3/envs/ucxx-cuda-118/bin/nvcc: line 9: /bin/nvcc: No such file or directory
→ cat /home/charlesb/miniforge3/envs/ucxx-cuda-118/bin/nvcc
#!/bin/bash
for arg in "${@}" ; do
case ${arg} in -ccbin)
# If -ccbin argument is already provided, don't add an additional one.
exec "${CUDA_HOME}/bin/nvcc" "${@}"
esac
done
exec "${CUDA_HOME}/bin/nvcc" -ccbin "${CXX}" "${@}"
IIRC, with CUDA 11.x CUDA_HOME
is redefined during conda activate
. Can you check if deactivating and reactivating your environment changes the behavior?
By "redefined" I mean it should be redefined to $CONDA_PREFIX
.
Ah thanks for that tip - this highlights what I assume is an underlying issue here, in that we aren't able to locate the CUDA_HOME
during environment activation:
→ conda activate ucxx-cuda-118
Cannot determine CUDA_HOME: cuda-gdb not in PATH
This warning specifically starts popping up with the installation of nvcc_linux-64
in the environment
I had this discussion with @robertmaynard in the past, his answer was:
conda nvcc scrip uses cuda-gdb to determine the cuda install location if CUDA_HOME hasn't been explicitly set beforehand so if the machine doesn't have cuda-gdb the conda activation scripts will fail to setup CUDA_HOME, which you will need to do manually
So yeah, I think you need a system install of CTK for CUDA 11.x to be able to compile.
Thanks @pentschev, installed CTK 12.5 on my system (seemingly the oldest version available for ubuntu24.04 right now), and that unblocked builds.
Moving forward, can or should we explicitly encode a CTK dependency similar to what RMM is doing in its CMakeLists.txt?
https://github.com/rapidsai/rmm/blob/c494395e58288cac16321ce90e9b15f3508ae89a/CMakeLists.txt#L62-L65
Or is this too brittle of a solution, with just general documentation of system installing CTK for 11.x builds making more sense?
Also worth noting that it seems like there's more required than just proper setting of CUDA_HOME
here, as even manually setting it to the CONDA_PREFIX
above that I can see contains libcudart
seems to raise the same failures
Also worth noting that it seems like there's more required than just proper setting of
CUDA_HOME
here, as even manually setting it to theCONDA_PREFIX
above that I can see containslibcudart
seems to raise the same failures
Can you try setting the env variable CUDA_PATH
that is what is used by CMake ( not CUDA_HOME ).
Thanks for the tip - looks like that's still failing. For reference the command I'm working with:
$ CUDA_PATH=/home/charlesb/miniforge3/envs/ucxx-cuda-118 ./build.sh
-- The C compiler identification is GNU 13.3.0
-- The CXX compiler identification is GNU 13.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /home/charlesb/miniforge3/envs/ucxx-cuda-118/bin/x86_64-conda-linux-gnu-cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /home/charlesb/miniforge3/envs/ucxx-cuda-118/bin/x86_64-conda-linux-gnu-c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE
-- CPM: Using local package rmm@24.12.0
-- Configuring done (2.1s)
CMake Error at /home/charlesb/miniforge3/envs/ucxx-cuda-118/lib/cmake/rmm/rmm-targets.cmake:61 (set_target_properties):
The link interface of target "rmm::rmm" contains:
CUDA::cudart
but the target was not found. Possible reasons include:
* There is a typo in the target name.
* A find_package call is missing for an IMPORTED target.
* An ALIAS target is missing.
Call Stack (most recent call first):
/home/charlesb/miniforge3/envs/ucxx-cuda-118/lib/cmake/rmm/rmm-config.cmake:75 (include)
build/cmake/CPM_0.40.0.cmake:249 (find_package)
build/cmake/CPM_0.40.0.cmake:303 (cpm_find_package)
build/_deps/rapids-cmake-src/rapids-cmake/cpm/find.cmake:189 (CPMFindPackage)
build/_deps/rapids-cmake-src/rapids-cmake/cpm/rmm.cmake:75 (rapids_cpm_find)
cmake/thirdparty/get_rmm.cmake:20 (rapids_cpm_rmm)
cmake/thirdparty/get_rmm.cmake:24 (find_and_configure_rmm)
CMakeLists.txt:112 (include)
-- Generating done (0.0s)
CMake Generate step failed. Build files cannot be regenerated correctly.
Output of conda list
:
I would need a full trace log from CMake to see what is exactly going wrong.
IIRC the command line would be:
CUDA_PATH=/home/charlesb/miniforge3/envs/ucxx-cuda-118 ./build.sh --cmake-args=\"--trace\" > log
Here's a log with CMake traces enabled:
Some clarification. The cuda-gdb
detection logic is what conda uses to manage finding a local install of CUDA 11.X
CMake uses different logic for finding nvcc
and from the extracting the rest of the CUDA Toolkit libraries and headers. @charlesbluca In the trace you provided the FindCUDAToolkit is failing since it can't find nvcc
or the sentinel versions files inside the CUDA Toolkit.
I think the primary issue is that CUDA_PATH
needs to point not to your conda env, but the local install of the cuda toolkit. E.g /usr/local/cuda-11.8/
@charlesbluca do you think there's still anything we should do in UCXX for better UX?
When attempting to build UCXX with the CUDA 11.8 conda environment on a system without
nvcc
pre-installed (i.e. all CTK components being installed through conda), I get the following error at build configuration:This was somewhat confusing, as the conda install itself raised a warning message that implied I should have
libcudart
in the conda environment:Saw that these failures were coming up in the configuration of RMM, so tried building that with its accompanying 11.8 conda environment and got a somewhat clearer error that it was unable to find an installation of CTK on my system (
nvcc
bin was missing):Was unable to reproduce this with the 12.5 environment, which does pull a conda installation of
nvcc
.About to do a system installation of CTK on the system to see if this unblocks