microsoft / superbenchmark

A validation and profiling tool for AI infrastructure
https://aka.ms/superbench
MIT License
251 stars 55 forks source link

Not able to test IB-TRAFFIC for NDR(H-100) Cluster #569

Closed NancyAgarwal013 closed 10 months ago

NancyAgarwal013 commented 10 months ago

What's the issue, what's expected?: Hi Team,

I am trying to run ib-traffic benchmark on my control node. Below is the yaml that I am using.

ib-traffic:       enable: true       modes:         - name: mpi           proc_num: 8       parameters:         msg_size: 8388608         ibdev: mlx5$LOCAL_RANK         gpu_dev: $LOCAL_RANK         numa_dev: $((LOCAL_RANK/2))

SKU version - Linux(ubuntu 22.04) Standard D2s v3 (2 vcpus, 8 GiB memory)

If I am using below superbench docker img -> nexusstaticacr.azurecr.io/superbench/superbench:v0.9.0-cuda12.1

For CUDA 11.1 and 12.1, getting below error in MPI mode -

mpi

For CUDA 11.1 in local mode getting below error-

“Failed to create UCP worker.”

ucp

For CUDA 12.1 – in local mode – limit exceeded error

limit

What I got to know after some research

ucx

https://learn.microsoft.com/en-us/azure/virtual-machines/hbv4-series-overview

So, is there something that I am missing here in the sb configuration?

abuccts commented 10 months ago

Hi @NancyAgarwal013, can you share the SKU of your IB nodes? Are they H100 SKUs on Azure or some on-premises DGX nodes?

To run the ib-traffic benchmark on H100 nodes, you will need cuda12.1 Docker and mpi mode, so other logs are not useful.

For the cuda12.1 + mpi mode error, I can only tell "192.168.1.107:/root/sb-workspace/outputs/2023-10-23_11-44-05/" does not exist according to your screenshot. Could you share all the commands you run and the related logs (you can append |& tee xxx.log to redirect all stdout/stderr to file)?

NancyAgarwal013 commented 10 months ago

Hi @abuccts , SKU is -> On premise HGX H100 Nodes.

I just normally deploy the sb, and then run my yaml fie by using the command sb run -f mix.ini -c ibtrafficTest.yaml

// here mix.ini is the ansible file that contains all the GPU Nodes IP's // ibtrafficTest is the actual yaml file where the bechmark conf is written

cp5555 commented 10 months ago

This issue has been solved offline.