[Bug] `mscclpp/concurrency_device.hpp: No such file or directory`

TZHelloWorld commented 1 month ago

i install the msclpp use code and use

pip3 install -e .

and then use the python to test：

mpirun --allow-run-as-root -np 2 python3 ./python/mscclpp_benchmark/allreduce_bench.py

i can use find / -name "concurrency_device.hpp" find ：

/usr/local/lib/python3.8/dist-packages/mscclpp/include/mscclpp/concurrency_device.hpp
/workspace/mscclpp/include/mscclpp/concurrency_device.hpp

but the error is report mscclpp/concurrency_device.hpp: No such file or directory：

--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   7234c86e716e
  Local device: mlx5_1
--------------------------------------------------------------------------
[1722937396.514978] [7234c86e716e:74755:0]       ib_iface.c:964  UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory
[1722937396.515009] [7234c86e716e:74756:0]       ib_iface.c:964  UCX  ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory
[7234c86e716e:74755] pml_ucx.c:309  Error: Failed to create UCP worker
[7234c86e716e:74756] pml_ucx.c:309  Error: Failed to create UCP worker
Selected Interface: eth0, IP Address: 172.17.0.9
Selected Interface: eth0, IP Address: 172.17.0.9
/workspace/mscclpp/python/mscclpp_benchmark/allreduce.cu:10:10: fatal error: mscclpp/concurrency_device.hpp: No such file or directory
   10 | #include <mscclpp/concurrency_device.hpp>
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
/workspace/mscclpp/python/mscclpp_benchmark/allreduce.cu:10:10: fatal error: mscclpp/concurrency_device.hpp: No such file or directory
   10 | #include <mscclpp/concurrency_device.hpp>
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
Traceback (most recent call last):
  File "/workspace/mscclpp/python/mscclpp/utils.py", line 125, in _compile_cuda
    subprocess.run(command, capture_output=True, text=True, check=True, bufsize=1)
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['nvcc', '-std=c++17', '-ptx', '-Xcompiler', '-Wall,-Wextra', '-I/workspace/mscclpp/python/mscclpp/include', '/workspace/mscclpp/python/mscclpp_benchmark/allreduce.cu', '--gpu-architecture=compute_80', '--gpu-code=sm_80,compute_80', '-o', '/tmp/tmpanvcn9_774755/allreduce2.ptx', '-DTYPE=float']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./python/mscclpp_benchmark/allreduce_bench.py", line 304, in <module>
    size, mscclpp_algBw, nccl_algBw, speed_up = run_benchmark(mscclpp_group, nccl_comm, table, 100, nelems)
  File "./python/mscclpp_benchmark/allreduce_bench.py", line 172, in run_benchmark
    mscclpp_algos = [MscclppAllReduce2(mscclpp_group, memory, memory_out)]
  File "/workspace/mscclpp/python/mscclpp_benchmark/mscclpp_op.py", line 122, in __init__
    self.kernel = KernelBuilder(
  File "/workspace/mscclpp/python/mscclpp/utils.py", line 80, in __init__
    ptx = self._compile_cuda(os.path.join(self._current_file_dir, file), f"{kernel_name}.ptx")
  File "/workspace/mscclpp/python/mscclpp/utils.py", line 130, in _compile_cuda
    raise RuntimeError("Compilation failed: ", " ".join(command))
RuntimeError: ('Compilation failed: ', 'nvcc -std=c++17 -ptx -Xcompiler -Wall,-Wextra -I/workspace/mscclpp/python/mscclpp/include /workspace/mscclpp/python/mscclpp_benchmark/allreduce.cu --gpu-architecture=compute_80 --gpu-code=sm_80,compute_80 -o /tmp/tmpanvcn9_774755/allreduce2.ptx -DTYPE=float')
Traceback (most recent call last):
  File "/workspace/mscclpp/python/mscclpp/utils.py", line 125, in _compile_cuda
    subprocess.run(command, capture_output=True, text=True, check=True, bufsize=1)
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['nvcc', '-std=c++17', '-ptx', '-Xcompiler', '-Wall,-Wextra', '-I/workspace/mscclpp/python/mscclpp/include', '/workspace/mscclpp/python/mscclpp_benchmark/allreduce.cu', '--gpu-architecture=compute_80', '--gpu-code=sm_80,compute_80', '-o', '/tmp/tmpuqy0j6zm74756/allreduce2.ptx', '-DTYPE=float']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./python/mscclpp_benchmark/allreduce_bench.py", line 304, in <module>
    size, mscclpp_algBw, nccl_algBw, speed_up = run_benchmark(mscclpp_group, nccl_comm, table, 100, nelems)
  File "./python/mscclpp_benchmark/allreduce_bench.py", line 172, in run_benchmark
    mscclpp_algos = [MscclppAllReduce2(mscclpp_group, memory, memory_out)]
  File "/workspace/mscclpp/python/mscclpp_benchmark/mscclpp_op.py", line 122, in __init__
    self.kernel = KernelBuilder(
  File "/workspace/mscclpp/python/mscclpp/utils.py", line 80, in __init__
    ptx = self._compile_cuda(os.path.join(self._current_file_dir, file), f"{kernel_name}.ptx")
  File "/workspace/mscclpp/python/mscclpp/utils.py", line 130, in _compile_cuda
    raise RuntimeError("Compilation failed: ", " ".join(command))
RuntimeError: ('Compilation failed: ', 'nvcc -std=c++17 -ptx -Xcompiler -Wall,-Wextra -I/workspace/mscclpp/python/mscclpp/include /workspace/mscclpp/python/mscclpp_benchmark/allreduce.cu --gpu-architecture=compute_80 --gpu-code=sm_80,compute_80 -o /tmp/tmpuqy0j6zm74756/allreduce2.ptx -DTYPE=float')
nanobind: leaked 2 instances!
nanobind: leaked 1 keep_alive records!
nanobind: leaked 4 types!
 - leaked type "mscclpp._mscclpp.Bootstrap"
 - leaked type "mscclpp._mscclpp.SmDevice2DeviceSemaphore"
 - leaked type "mscclpp._mscclpp.DeviceHandle"
 - leaked type "mscclpp._mscclpp.TcpBootstrap"
nanobind: leaked 24 functions!
 - leaked function ""
 - leaked function "__init__"
 - leaked function "all_gather"
 - leaked function ""
 - leaked function "create_unique_id"
 - leaked function "send"
 - leaked function "__init__"
 - leaked function "device_handle"
 - leaked function ""
 - leaked function "get_n_ranks_per_node"
 - leaked function "get_n_ranks"
 - ... skipped remainder
nanobind: this is likely caused by a reference counting issue in the binding code.
nanobind: leaked 2 instances!
nanobind: leaked 1 keep_alive records!
nanobind: leaked 4 types!
 - leaked type "mscclpp._mscclpp.Bootstrap"
 - leaked type "mscclpp._mscclpp.SmDevice2DeviceSemaphore"
 - leaked type "mscclpp._mscclpp.DeviceHandle"
 - leaked type "mscclpp._mscclpp.TcpBootstrap"
nanobind: leaked 24 functions!
 - leaked function ""
 - leaked function "get_n_ranks"
 - leaked function ""
 - leaked function "__init__"
 - leaked function "barrier"
 - leaked function "__init__"
 - leaked function ""
 - leaked function "create"
 - leaked function "device_handle"
 - leaked function "get_n_ranks_per_node"
 - leaked function "recv"
 - ... skipped remainder
nanobind: this is likely caused by a reference counting issue in the binding code.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[48938,1],1]
  Exit code:    1
--------------------------------------------------------------------------
[7234c86e716e:74751] 1 more process has sent help message help-mpi-btl-openib.txt / error in device init
[7234c86e716e:74751] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

chhwang commented 1 month ago

Please unset MSCCLPP_HOME and retry.

TZHelloWorld commented 1 month ago

My issue is caused by the error ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory.

To explain my situation: I am creating a container using a Docker image. For security reasons, I did not use the --privilegedmode. Instead, I used the--deviceflag to access the IB devices located in the /dev/infiniband/ directory on the host machine within the container. After entering the container, when I ran the ucx_info -dcommand, I encountered the error UCX ERROR ibv_create_cq(cqe=4096) failed: Cannot allocate memory. To resolve this issue, it is necessary to add the --cap-add=IPC_LOCK option when creating the container, allowing it to access the InfiniBand devices and the host network.

microsoft / mscclpp

[Bug] `mscclpp/concurrency_device.hpp: No such file or directory` #335