how does use PCIe peer-to-peer or NVLink between two containers that each have an isolated GPU

linxiaochou commented 3 weeks ago

I am a new user of UCX. Now have a situation where two different containers each use different GPU, and the two GPUs devices on the Host can communicate via PCIe P2P or NVLink. But in containers they can't communicate via PCIe P2P or NVLink.

I am looking how to solve this problem.

See the NVLink and Docker/Kubernetes section of the ucx-py readthedocs documentation: In order to use NVLink when running in containers using Docker and/or Kubernetes the processes must share an IPC namespace for NVLink to work correctly.

Who can answer that can UCX solve this problem? And How can this problem be solved, if at all. Your assistance in this matter will be greatly appreciated.

rakhmets commented 3 weeks ago

Please try to share process IDs between containers. E.g. add the following option to the command running the first docker:

--name docker_1

, and to the second CL:

--pid=container:docker_1

Then containers will share PID namespace.

linxiaochou commented 3 weeks ago

@rakhmets Thank you for your reply and suggestions.

I tried your method by: The first container: docker run --name master -it --rm --gpus device=0 --network bridge --ipc host -v $(pwd):/data --entrypoint /bin/bash nvcr.io/nvidia/pytorch:24.01-py3 The second container: docker run -it --rm --gpus device=1 --network bridge --ipc host --pid 'container:master' -v $(pwd):/data --entrypoint /bin/bash nvcr.io/nvidia/pytorch:24.01-py3

The two containers each use different GPU, following is the topology shown by nvidia-smi topo -m:

GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV12 SYS SYS PIX PIX SYS SYS 0-15,32-47 0 N/A GPU1 NV12 X SYS SYS PIX PIX SYS SYS 0-15,32-47 0 N/A GPU2 SYS SYS X NV12 SYS SYS PIX PIX 16-31,48-63 1 N/A GPU3 SYS SYS NV12 X SYS SYS PIX PIX 16-31,48-63 1 N/A NIC0 PIX PIX SYS SYS X PIX SYS SYS
NIC1 PIX PIX SYS SYS PIX X SYS SYS
NIC2 SYS SYS PIX PIX SYS SYS X PIX
NIC3 SYS SYS PIX PIX SYS SYS PIX X

And then run command in this container: The first container: torchrun --nnodes 2 --nproc_per_node 1 --node_rank 0 --master_addr 172.17.0.2 --master_port 29400 multinode.py The second container: torchrun --nnodes 2 --nproc_per_node 1 --node_rank 1 --master_addr 172.17.0.2 --master_port 29400 multinode.py

But as a result, the first container reported an error, and the output is as follows:

[1724241372.462397] [2a292d2c18cc:984 :0] tl_cuda_cache.c:231 UCC ERROR ipc-cache: failed to open ipc mem handle. addr:0x7fe456000000 len:16777216 err:1 Traceback (most recent call last): File "/data/multinode.py", line 141, in main(args.save_every, args.total_epochs, args.batch_size) File "/data/multinode.py", line 128, in main trainer = Trainer(model, train_data, optimizer, save_every, snapshot_path) File "/data/multinode.py", line 65, in init self.model = DDP(self.model, device_ids=[self.local_rank]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 783, in init _verify_param_shape_across_processes(self.process_group, parameters) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 264, in _verify_param_shape_across_processes return dist._verify_params_across_processes(process_group, tensors, logger) RuntimeError: CUDA error: invalid argument CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. [2024-08-21 11:56:17,385] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 984) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 351, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: . . . Root Cause (first observed failure): [0]: time : 2024-08-21_11:56:17 host : 2a292d2c18cc rank : 0 (local_rank: 0) exitcode : 1 (pid: 984) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

And the second container is stuck with no output.

My understanding is that UCC is a communication library established based on UCX. I don't know if my understanding is wrong. If so, please tell me. Later, I looked at the code location of the UCC error, which uses the CUDA IPC interface.

Does this interface require two GPUs to be used without container splitting? So I tried to mount both GPUs into containers using the --gpus parameter, both containers using the same two GPUs. This time it should work. Both containers have outputs. However, nvidia-smi observed that GPU0 was used by both containers, while GPU1 was not.

So I would like to ask whether this error was caused by UCC? If so, could you please give an example of UCX? Looking forward to your reply and suggestions.

openucx / ucx

how does use PCIe peer-to-peer or NVLink between two containers that each have an isolated GPU #10070