Add NVTX support and RMM_FUNC_RANGE() macro

harrism commented 2 months ago

Description

Let's get RMM allocate/deallocates showing up in profiler timelines.

Closes #495

Checklist

[X] I am familiar with the Contributing Guidelines.
[ ] New or existing tests cover these changes.
[X] The documentation is up to date with these changes.

bdice commented 2 months ago

The build failure shows:

    /usr/bin/sccache /home/coder/.conda/envs/rapids/bin/x86_64-conda-linux-gnu-c++ -DFMT_HEADER_ONLY=1 -DLIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCE -DSPDLOG_ACTIVE_LEVEL=SPDLOG_LEVEL_ -DSPDLOG_FMT_EXTERNAL -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_CPP -D_torch_allocator_EXPORTS -I/home/coder/rmm/build/conda/cuda-12.2/release/_deps/cccl-src/thrust/thrust/cmake/../.. -I/home/coder/rmm/build/conda/cuda-12.2/release/_deps/cccl-src/libcudacxx/lib/cmake/libcudacxx/../../../include -I/home/coder/rmm/build/conda/cuda-12.2/release/_deps/cccl-src/cub/cub/cmake/../.. -isystem /home/coder/rmm/include -isystem /home/coder/rmm/build/conda/cuda-12.2/release/_deps/fmt-src/include -isystem /home/coder/rmm/build/conda/cuda-12.2/release/_deps/spdlog-src/include -fvisibility-inlines-hidden -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/coder/.conda/envs/rapids/include  -I/home/coder/.conda/envs/rapids/targets/x86_64-linux/include  -L/home/coder/.conda/envs/rapids/targets/x86_64-linux/lib -L/home/coder/.conda/envs/rapids/targets/x86_64-linux/lib/stubs -O3 -DNDEBUG -fPIC -MD -MT rmm/_lib/CMakeFiles/_torch_allocator.dir/_torch_allocator.cpp.o -MF rmm/_lib/CMakeFiles/_torch_allocator.dir/_torch_allocator.cpp.o.d -o rmm/_lib/CMakeFiles/_torch_allocator.dir/_torch_allocator.cpp.o -c /home/coder/rmm/python/rmm/rmm/_lib/_torch_allocator.cpp
    In file included from /home/coder/rmm/include/rmm/mr/device/device_memory_resource.hpp:20,
                     from /home/coder/rmm/include/rmm/mr/device/cuda_memory_resource.hpp:20,
                     from /home/coder/rmm/include/rmm/mr/device/per_device_resource.hpp:21,
                     from /home/coder/rmm/python/rmm/rmm/_lib/_torch_allocator.cpp:19:
    /home/coder/rmm/include/rmm/detail/nvtx/ranges.hpp:19:10: fatal error: nvtx3/nvtx3.hpp: No such file or directory
       19 | #include <nvtx3/nvtx3.hpp>
          |          ^~~~~~~~~~~~~~~~~
    compilation terminated.

This is because we're using NVTX as a BUILD_LOCAL_INTERFACE, so it doesn't seem to get picked up by the rmm Python package from the librmm header-only library:

target_link_libraries(rmm INTERFACE $<BUILD_LOCAL_INTERFACE:nvtx3-cpp>)

I know we had to enable this for cuDF in https://github.com/rapidsai/cudf/pull/15271 because not doing so led to breakage in static builds. I don't know how this should be handled for RMM as a header-only library. Maybe we just need NVTX to have a normal INTERFACE linkage, since it's required by public RMM headers?

edit: I checked with @KyleFromNVIDIA and attempted this change in 702915b.

harrism commented 1 month ago

Depends on #rapidsai/rapids-cmake#606

harrism commented 1 month ago

/merge

rapidsai / rmm

Add NVTX support and RMM_FUNC_RANGE() macro #1558

Description

Checklist