microsoft / superbenchmark

A validation and profiling tool for AI infrastructure
https://aka.ms/superbench
MIT License
263 stars 57 forks source link

Unbale to run on Nvidia A100 #625

Closed mittaltarkik closed 2 months ago

mittaltarkik commented 4 months ago

What's the issue, what's expected?: I am facing issue while running SB on Nvidia A100.

How to reproduce it?: VM : Standard ND96asr v4 (96 vcpus, 900 GiB memory) OS : Linux (ubuntu 22.04) Cuda : cuda_12.4.0_550.54 SB Version : 10, Docker file error_logs.txt error_logs.txt : cuda12.2

Log message or shapshot?: Logs file attached

Additional information: Please help us with correct configuration

avnf commented 4 months ago

Hello. You had sb deploy --host-list localhost -i superbench/superbench:v0.10.0-cuda12.4 command which points to not yet released docker image. Most recent available image is superbench/superbench:v0.10.0-cuda12.2 according to https://hub.docker.com/r/superbench/superbench/tags page.