Closed NancyAgarwal013 closed 10 months ago
Hi @NancyAgarwal013, can you share the SKU of your IB nodes? Are they H100 SKUs on Azure or some on-premises DGX nodes?
To run the ib-traffic benchmark on H100 nodes, you will need cuda12.1 Docker and mpi mode, so other logs are not useful.
For the cuda12.1 + mpi mode error, I can only tell "192.168.1.107:/root/sb-workspace/outputs/2023-10-23_11-44-05/" does not exist according to your screenshot. Could you share all the commands you run and the related logs (you can append |& tee xxx.log
to redirect all stdout/stderr to file)?
Hi @abuccts , SKU is -> On premise HGX H100 Nodes.
I just normally deploy the sb, and then run my yaml fie by using the command sb run -f mix.ini -c ibtrafficTest.yaml
// here mix.ini is the ansible file that contains all the GPU Nodes IP's // ibtrafficTest is the actual yaml file where the bechmark conf is written
This issue has been solved offline.
What's the issue, what's expected?: Hi Team,
I am trying to run ib-traffic benchmark on my control node. Below is the yaml that I am using.
ib-traffic: enable: true modes: - name: mpi proc_num: 8 parameters: msg_size: 8388608 ibdev: mlx5$LOCAL_RANK gpu_dev: $LOCAL_RANK numa_dev: $((LOCAL_RANK/2))
SKU version - Linux(ubuntu 22.04) Standard D2s v3 (2 vcpus, 8 GiB memory)
If I am using below superbench docker img -> nexusstaticacr.azurecr.io/superbench/superbench:v0.9.0-cuda12.1
For CUDA 11.1 and 12.1, getting below error in MPI mode -
For CUDA 11.1 in local mode getting below error-
“Failed to create UCP worker.”
For CUDA 12.1 – in local mode – limit exceeded error
What I got to know after some research
https://learn.microsoft.com/en-us/azure/virtual-machines/hbv4-series-overview
So, is there something that I am missing here in the sb configuration?