rapidsai / cugraph

cuGraph - RAPIDS Graph Analytics Library
https://docs.rapids.ai/api/cugraph/stable/
Apache License 2.0
1.77k stars 304 forks source link

Issues building C++ multi-GPU test #4596

Closed sg0 closed 3 months ago

sg0 commented 3 months ago

What is your question?

I have been trying to build a C++-based multi-GPU example from cugraph on a Linux cluster:

https://github.com/rapidsai/cugraph/blob/branch-24.10/cpp/examples/users/multi_gpu_application/mg_graph_algorithms.cpp

But I am encountering several “not declared in scope” issues, such as the following, which suggests that I am probably not passing some path or missing a dependency:

  mg_graph_algorithms.cpp:192:33: error: invalid conversion from ‘rmm::mr::device_memory_resource*’ to ‘int’ [-fpermissive]
    192 |   rmm::device_uvector<vertex_t> d_predecessors(graph_view.local_vertex_partition_range_size(),
        |                                 ^~~~~~~~~~~~~~
        |                                 |
        |                                 rmm::mr::device_memory_resource*
  mg_graph_algorithms.cpp:195:33: error: invalid conversion from ‘rmm::mr::device_memory_resource*’ to ‘int’ [-fpermissive]
    195 |   rmm::device_uvector<vertex_t> d_sources(1, handle.get_stream());
        |                                 ^~~~~~~~~
        |                                 |
        |                                 rmm::mr::device_memory_resource*
  mg_graph_algorithms.cpp:222:33: error: invalid conversion from ‘rmm::mr::device_memory_resource*’ to ‘int’ [-fpermissive]
    222 |   rmm::device_uvector<vertex_t> d_cluster_assignments(
        |                                 ^~~~~~~~~~~~~~~~~~~~~
        |                                 |
        |                                 rmm::mr::device_memory_resource*

Since building cugraph from source is time consuming (I gave up after ~6 hours), I decided to pull the packages via conda (attached the yml file). Here are the modules on my platform:

  Currently Loaded Modulefiles:
    1) gcc/12.2.0               2) openmpi/4.1.4            3) cuda/12.1                4) cmake/3.28.1             5) python/miniconda24.4.0

I am trying to build using:

  mpic++ -DSPDLOG_FMT_EXTERNAL -I/share/apps/cuda/12.1/include -I/people/ghos167/.conda/envs/cugraph-ldgpu2/include -std=c++17 -o mg_test mg_graph_algorithms.cpp -L/share/apps/cuda/12.1/lib -L/people/ghos167/.conda/envs/cugraph-ldgpu2/lib -lcuda -lcudart -lcugraph

cugraph-ldgpu2.yml.txt

Code of Conduct

ChuckHastings commented 3 months ago

I'm sorry you're having difficulty getting things to compile. These examples were constructed assuming that you would build cugraph from source, so we haven't tested doing what you are attempting to do. I have done some work in this regard.

I can't reproduce your environment entirely. However I was able to do the following:

Try adding the following to your compile line. If it doesn't work, let me know if you are seeing the same errors, or new errors (and what they are).

 -DFMT_HEADER_ONLY=1 -DLIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCE -DSPDLOG_FMT_EXTERNAL -DTHRUST_DISABLE_ABI_NAMESPACE -DTHRUST_IGNORE_ABI_NAMESPACE_ERROR 

You're already setting SPDLOG_FMT_EXTERNAL, don't need to specify it twice.

sg0 commented 3 months ago

Thanks, I get a long list of errors, mostly redefinition errors, owing to CCCL (cuda-cccl):

/people/ghos167/.conda/envs/cugraph-ldgpu2/include/cuda/std/detail/libcxx/include/__concepts/../__concepts/../__concepts/convertible_to.h:60:1: note: in expansion of macro ‘_LIBCUDACXX_CONCEPT_FRAGMENT’
   757    60 | _LIBCUDACXX_CONCEPT_FRAGMENT(
   758       | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
   759 /people/ghos167/.conda/envs/cugraph-ldgpu2/include/cuda/std/detail/libcxx/include/__concepts/../__concepts/../__concepts/convertible_to.h:63:14: error: there are no arguments to ‘_LIBCUDACXX_TRAIT’ that depend on a template parameter, so a declaration of ‘_LIBCUDACX       X_TRAIT’ must be available [-fpermissive]
   760    63 |     requires(_LIBCUDACXX_TRAIT(is_convertible, _From, _To)),
   761       |              ^~~~~~~~~~~~~~~~~
   762 /people/ghos167/.conda/envs/cugraph-ldgpu2/include/cuda/std/detail/libcxx/include/__concepts/__concept_macros.h:225:23: note: in definition of macro ‘_LIBCUDACXX_CONCEPT_FRAGMENT_REQS_REQUIRES_requires’
   763   225 |   _Concept::_Requires<__VA_ARGS__>
   764       |                       ^~~~~~~~~~~
   765 /people/ghos167/.conda/envs/cugraph-ldgpu2/include/cuda/std/detail/libcxx/include/__concepts/__concept_macros.h:41:39: note: in expansion of macro ‘_LIBCUDACXX_PP_CAT4_’
   766    41 | #define _LIBCUDACXX_PP_CAT4(_Xp, ...) _LIBCUDACXX_PP_CAT4_(_Xp, __VA_ARGS__)
   767       |                                       ^~~~~~~~~~~~~~~~~~~~
   768 /people/ghos167/.conda/envs/cugraph-ldgpu2/include/cuda/std/detail/libcxx/include/__concepts/__concept_macros.h:153:3: note: in expansion of macro ‘_LIBCUDACXX_PP_CAT4’
   769   153 |   _LIBCUDACXX_PP_CAT4(_LIBCUDACXX_CONCEPT_FRAGMENT_REQS_REQUIRES_, _REQ)
   770       |   ^~~~~~~~~~~~~~~~~~~
   771 /people/ghos167/.conda/envs/cugraph-ldgpu2/include/cuda/std/detail/libcxx/include/__concepts/__concept_macros.h:147:3: note: in expansion of macro ‘_LIBCUDACXX_CONCEPT_FRAGMENT_REQS_REQUIRES_OR_NOEXCEPT’
   772   147 |   _LIBCUDACXX_CONCEPT_FRAGMENT_REQS_REQUIRES_OR_NOEXCEPT
   773       |   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   774 /people/ghos167/.conda/envs/cugraph-ldgpu2/include/cuda/std/detail/libcxx/include/__concepts/__concept_macros.h:34:40: note: in expansion of macro ‘_LIBCUDACXX_CONCEPT_FRAGMENT_REQS_M0’
   775    34 | #define _LIBCUDACXX_PP_CAT2_(_Xp, ...) _Xp##__VA_ARGS__

So, I uninstalled CCCL, and then retried:

(cugraph-ldgpu2) [ghos167@deception04 mg-graph]$ mpic++ -DSPDLOG_FMT_EXTERNAL -DFMT_HEADER_ONLY=1 -DLIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCE -DTHRUST_DISABLE_ABI_NAMESPACE -DTHRUST_IGNORE_ABI_NAMESPACE_ERROR -I/share/apps/cuda/12.1/include -I/people/ghos167/.conda/envs/cugraph-ldgpu2/include -std=c++17 -o mg_test mg_graph_algorithms.cpp -L/share/apps/cuda/12.1/lib -L/people/ghos167/.conda/envs/cugraph-ldgpu2/lib -lcuda -lcudart -lcugraph
In file included from /people/ghos167/.conda/envs/cugraph-ldgpu2/include/rmm/device_uvector.hpp:19,
                 from /people/ghos167/.conda/envs/cugraph-ldgpu2/include/cugraph/dendrogram.hpp:18,
                 from /people/ghos167/.conda/envs/cugraph-ldgpu2/include/cugraph/algorithms.hpp:19,
                 from mg_graph_algorithms.cpp:17:
/people/ghos167/.conda/envs/cugraph-ldgpu2/include/rmm/cuda_stream_view.hpp:21:10: fatal error: cuda/stream_ref: No such file or directory
   21 | #include <cuda/stream_ref>
      |          ^~~~~~~~~~~~~~~~~
compilation terminated.

So, above is the error why I installed CCCL.

ChuckHastings commented 3 months ago

Thanks for that update. I think I have a fix for that also, but I'm away from my computer for the night. I'll post something in the morning.

ChuckHastings commented 3 months ago

The problem you are seeing, I believe, is due to the fact that some of the header files are present in multiple directories. One drawback of the #pragma once approach that most C++ developers have moved to for avoiding duplicate headers is that if the same header file appears in different directories it can actually be included twice - resulting in the duplicate symbols you are seeing.

You'll need to experiment a bit, since I can't exactly replicate your environment. Here's a combination of include options that worked for me.

-I/raid/charlesh/mambaforge/envs/test_issue_4596/include/rapids -I/raid/charlesh/mambaforge/envs/test_issue_4596/include/rapids/libcudacxx -isystem /raid/charlesh/mambaforge/envs/test_issue_4596/include -isystem /raid/charlesh/mambaforge/envs/test_issue_4596/targets/x86_64-linux/include 

This link provides some explanation of the -I vice -isystem motivation. Short version is -I is searched first, then -isystem, then system libraries. The objective is to separate things so that the duplicate header files are at a different level (only one in the -I directories, any others in a -isystem directory or one of the system libraries).

Obviously change the directory path to point to your conda environment. A RAPIDS installation should have the first 2 elements as part of your conda environment. That should give you the proper versions of thrust, cub, CCCL. The third item in the list will get you cugraph and any other conda packages installed. The last item I think is required to pick up some of the headers specific to x86_64 architectures (some of the implementation details).

sg0 commented 3 months ago

Thanks, I encountered a long list of errors with -DLIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCE in the updated includes after your suggestion, so I removed it:

mpic++ -DSPDLOG_FMT_EXTERNAL -DFMT_HEADER_ONLY=1 -DTHRUST_DISABLE_ABI_NAMESPACE -DTHRUST_IGNORE_ABI_NAMESPACE_ERROR -I/share/apps/cuda/12.1/include -I/people/ghos167/.conda/envs/cugraph-ldgpu2/include/rapids -I/people/ghos167/.conda/envs/cugraph-ldgpu2/include/rapids/libcudacxx -isystem /people/ghos167/.conda/envs/cugraph-ldgpu2/include -isystem /people/ghos167/.conda/envs/cugraph-ldgpu2/targets/x86_64-linux/include -std=c++17 -o mg_test mg_graph_algorithms.cpp -L/share/apps/cuda/12.1/lib -L/people/ghos167/.conda/envs/cugraph-ldgpu2/lib -lcuda -lcudart -lcugrap

But then, there are these CUDA namespace related issues in RMM:

In file included from /people/ghos167/.conda/envs/cugraph-ldgpu2/include/rmm/device_uvector.hpp:19,
  2                  from /people/ghos167/.conda/envs/cugraph-ldgpu2/include/cugraph/dendrogram.hpp:18,
  3                  from /people/ghos167/.conda/envs/cugraph-ldgpu2/include/cugraph/algorithms.hpp:19,
  4                  from mg_graph_algorithms.cpp:17:
  5 /people/ghos167/.conda/envs/cugraph-ldgpu2/include/rmm/cuda_stream_view.hpp:67:30: error: ‘cuda’ has not been declared
  6    67 |   constexpr cuda_stream_view(cuda::stream_ref stream) noexcept : stream_{stream.get()} {}
  7       |                              ^~~~
  8 /people/ghos167/.conda/envs/cugraph-ldgpu2/include/rmm/cuda_stream_view.hpp:67:46: error: expected ‘)’ before ‘stream’
  9    67 |   constexpr cuda_stream_view(cuda::stream_ref stream) noexcept : stream_{stream.get()} {}
 10       |                             ~                ^~~~~~~
 11       |                                              )
 12 /people/ghos167/.conda/envs/cugraph-ldgpu2/include/rmm/cuda_stream_view.hpp:67:88: error: expected unqualified-id before ‘{’ token
 13    67 |   constexpr cuda_stream_view(cuda::stream_ref stream) noexcept : stream_{stream.get()} {}
 14       |                                                                                        ^
 15 /people/ghos167/.conda/envs/cugraph-ldgpu2/include/rmm/cuda_stream_view.hpp:88:22: error: ‘cuda’ does not name a type; did you mean ‘cudaPos’?
 16    88 |   constexpr operator cuda::stream_ref() const noexcept { return value(); }
 17       |                      ^~~~
 18       |                      cudaPos
 19 In file included from /people/ghos167/.conda/envs/cugraph-ldgpu2/include/rmm/mr/device/cuda_memory_resource.hpp:20,
 20                  from /people/ghos167/.conda/envs/cugraph-ldgpu2/include/rmm/mr/device/per_device_resource.hpp:21,
 21                  from /people/ghos167/.conda/envs/cugraph-ldgpu2/include/rmm/device_buffer.hpp:21,
 22                  from /people/ghos167/.conda/envs/cugraph-ldgpu2/include/rmm/device_uvector.hpp:22:
 23 /people/ghos167/.conda/envs/cugraph-ldgpu2/include/rmm/mr/device/device_memory_resource.hpp:310:59: error: ‘cuda’ has not been declared
 24   310 |   friend void get_property(device_memory_resource const&, cuda::mr::device_accessible) noexcept {}
 25       |                                                           ^~~~
 26 /people/ghos167/.conda/envs/cugraph-ldgpu2/include/rmm/mr/device/device_memory_resource.hpp:359:15: error: ‘cuda’ has not been declared
 27   359 | static_assert(cuda::mr::async_resource_with<device_memory_resource, cuda::mr::device_accessible>);
 28       |               ^~~~
 29 /people/ghos167/.conda/envs/cugraph-ldgpu2/include/rmm/mr/device/device_memory_resource.hpp:359:67: error: expected primary-expression before ‘,’ token
 30   359 | static_assert(cuda::mr::async_resource_with<device_memory_resource, cuda::mr::device_accessible>);
 31       |                                                                   ^
 32 /people/ghos167/.conda/envs/cugraph-ldgpu2/include/rmm/mr/device/device_memory_resource.hpp:359:69: error: expected string-literal before ‘cuda’
 33   359 | static_assert(cuda::mr::async_resource_with<device_memory_resource, cuda::mr::device_accessible>);
 34       |                                                                     ^~~~
 35 /people/ghos167/.conda/envs/cugraph-ldgpu2/include/rmm/mr/device/device_memory_resource.hpp:359:68: error: expected ‘)’ before ‘cuda’
 36   359 | static_assert(cuda::mr::async_resource_with<device_memory_resource, cuda::mr::device_accessible>);
 37       |              ~                                                     ^~~~~
 38       |                                                                    )

I also tried building RMM separately, but see the previous errors.

/people/ghos167/builds/rmm-cuda-12.1/include/rmm/device_buffer.hpp:171:79: error: invalid conversion from ‘rmm::mr::device_memory_resource*’ to ‘int’ [-fpermissive]
  171 |                 device_async_resource_ref mr = mr::get_current_device_resource())
      |                                                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
      |                                                                               |
      |                                                                               rmm::mr::device_memory_resource*
/people/ghos167/builds/rmm-cuda-12.1/include/rmm/device_buffer.hpp: In constructor ‘rmm::device_buffer::device_buffer()’:
/people/ghos167/builds/rmm-cuda-12.1/include/rmm/device_buffer.hpp:97:21: error: class ‘rmm::device_buffer’ does not have any field named ‘_mr’
   97 |   device_buffer() : _mr{rmm::mr::get_current_device_resource()} {}
      |                     ^~~
/people/ghos167/builds/rmm-cuda-12.1/include/rmm/device_buffer.hpp: In constructor ‘rmm::device_buffer::device_buffer(std::size_t, rmm::cuda_stream_view, int)’:
/people/ghos167/builds/rmm-cuda-12.1/include/rmm/device_buffer.hpp:112:24: error: class ‘rmm::device_buffer’ does not have any field named ‘_mr’
  112 |     : _stream{stream}, _mr{mr}
      |                        ^~~
/people/ghos167/builds/rmm-cuda-12.1/include/rmm/device_buffer.hpp: In constructor ‘rmm::device_buffer::device_buffer(const void*, std::size_t, rmm::cuda_stream_view, int)’:
/people/ghos167/builds/rmm-cuda-12.1/include/rmm/device_buffer.hpp:141:24: error: class ‘rmm::device_buffer’ does not have any field named ‘_mr’
  141 |     : _stream{stream}, _mr{mr}
      |                        ^~~
/people/ghos167/builds/rmm-cuda-12.1/include/rmm/device_buffer.hpp: In constructor ‘rmm::device_buffer::device_buffer(rmm::device_buffer&&)’:
/people/ghos167/builds/rmm-cuda-12.1/include/rmm/device_buffer.hpp:192:7: error: class ‘rmm::device_buffer’ does not have any field named ‘_mr’
  192 |       _mr{other._mr},
      |       ^~~
/people/ghos167/builds/rmm-cuda-12.1/include/rmm/device_buffer.hpp:192:17: error: ‘class rmm::device_buffer’ has no member named ‘_mr’
  192 |       _mr{other._mr},
      |                 ^~~
/people/ghos167/builds/rmm-cuda-12.1/include/rmm/device_buffer.hpp: In member function ‘rmm::device_buffer& rmm::device_buffer::operator=(rmm::device_buffer&&)’:
/people/ghos167/builds/rmm-cuda-12.1/include/rmm/device_buffer.hpp:226:7: error: ‘_mr’ was not declared in this scope; did you mean ‘mr’?
  226 |       _mr     = other._mr;
      |       ^~~
      |       mr
/people/ghos167/builds/rmm-cuda-12.1/include/rmm/device_buffer.hpp:226:23: error: ‘class rmm::device_buffer’ has no member named ‘_mr’
  226 |       _mr     = other._mr;
ChuckHastings commented 3 months ago

Try moving -I/share/apps/cuda/12.1/include to be -isystem /share/apps/cuda/12.1/include, and maybe put it last in the order.

The error looks like you're picking up a different version of a cuda file.

ChuckHastings commented 3 months ago

Try moving -I/share/apps/cuda/12.1/include to be -isystem /share/apps/cuda/12.1/include, and maybe put it last in the order.

The error looks like you're picking up a different version of a cuda file.

Nevermind. This error occurs when you don't include -DLIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCE, see need that flag enabled.

What errors are you seeing when you enable that flag?

sg0 commented 3 months ago

Upon close inspection, I noticed that the errors were "multiple redefinition errors", so I moved the CUDA runtime headers later as you had suggested. Then it worked:

mpic++ -DSPDLOG_FMT_EXTERNAL -DFMT_HEADER_ONLY=1 -DLIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCE -DTHRUST_DISABLE_ABI_NAMESPACE -DTHRUST_IGNORE_ABI_NAMESPACE_ERROR -I/people/ghos167/.conda/envs/cugraph-ldgpu2/include/rapids -I/people/ghos167/.conda/envs/cugraph-ldgpu2/include/rapids/libcudacxx -isystem /people/ghos167/.conda/envs/cugraph-ldgpu2/include -isystem /people/ghos167/.conda/envs/cugraph-ldgpu2/targets/x86_64-linux/include -isystem /share/apps/cuda/12.1/include -std=c++17 -o mg_test mg_graph_algorithms.cpp -L/share/apps/cuda/12.1/lib -L/people/ghos167/.conda/envs/cugraph-ldgpu2/lib -ldl -lcudart -lcugraph -lnccl

Thanks for the suggestions, I am closing the issue.

ChuckHastings commented 3 months ago

Great. Glad you were able to resolve this.