openucx / ucc

Unified Collective Communication Library
https://openucx.github.io/ucc/
BSD 3-Clause "New" or "Revised" License
195 stars 96 forks source link

Mixed usage with TL/UCP and TL/MLX5:create tl_mlx5 ctx failed #1009

Closed yanminglai closed 1 month ago

yanminglai commented 1 month ago

I am trying to mix with ucp and mlx5: use tl mlx5 for all2all and use tl ucp for all other collective operations.

how I configure ucc: "${UCC_SRC_DIR}/configure" --with-ucx="${UCX_HOME}" \ --prefix="${UCC_INSTALL_DIR}" --with-mpi \ --with-ibverbs \ --with-rdmacm \ --with-tls=self,shm,ucp,mlx5 \ run command: mpirun -x UCC_CLS=basic -x UCC_CL_BASIC_TLS=ucp,mlx5 -x UCC_TL_UCP_TUNE=alltoall:0 -x UCC_TL_MLX5_NET_DEVICES=mlx5_2:1 -np 4 ./ucc_test_mpi -c alltoall -o min

image

Then I also test use tl mlx5 only: mpirun -x UCC_TLS=mlx5 -x UCC_TL_MLX5_NET_DEVICES=mlx5_2:1 -np 2 ./ucc_test_mpi -c alltoall

Also met the ctx create problem image

here is my ib_dev and bw test image

Two Questions:

  1. Is this the right way to mix tl usage? (by setting UCC_TL_UCP_TUNE=alltoall:0,it will 100% use tl mlx5 for all2all)
  2. how can I allocate the tl mlx5 ctx create problem?
samnordmann commented 1 month ago

Hi @yanminglai Thanks for this report.

  1. First of all, tl/mlx5/a2a has been temporarily disabled in the repo, but is re-enabled by this PR which is about to be merged: #996
  2. in your command, when you try to use mlx5 only, please try to remove UCC_TLS=mlx5. It may seem counterintuitive, but the reason is that TL/MLX5 uses TL/UCP for service collectives.
  3. I was able to run tl/mlx5 successfully on upstream/master + #996 by running the command line:
    mpirun -x UCC_COLL_TRACE=info -x UCC_TL_MLX5_NET_DEVICES=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_1:1 -x UCC_TL_MLX5_TUNE=inf --mca coll_ucc_enable 0  --map-by ppr:2:node -np 4 test/mpi/ucc_test_mpi -c alltoall -t world -d uint8 -O 0 -v -m 1:128

Other remarks:

Hoping it will be useful. Let me know if you have further issues

yanminglai commented 1 month ago

Thank you very much, it answers all my questions. Gonna go ahead and close the issue.