microsoft / superbenchmark

A validation and profiling tool for AI infrastructure
https://aka.ms/superbench
MIT License
251 stars 55 forks source link

V0.10.0 Release Plan #559

Closed cp5555 closed 8 months ago

cp5555 commented 1 year ago

Release Manager

@cp5555

Endgame

Main Features

SuperBench Improvement

    • [x] Support Monitoring for AMD GPUs (#518 and #601)
    • [x] Support ROCm5.7 and ROCm6.0 dockerfile (#587, #598, and #602)
    • [x] Add MSCCL Support for Nvidia GPU (#584)
    • [x] Fix NUMA Domains Swap Issue in NDv4 Topology File (#592)
    • [x] Add ndv5 topo file (#597)
    • [x] Fix NCCL and NCCL-test to 2.18.3 for hang issue in CUDA12.2 (#599)

Micro-benchmark Improvement

    • [x] Add HPL random generator to gemm-flops with ROCm (#578)
    • [x] Add DirectXGPURenderFPS Benchmark to measure the FPS of rendering simple frames (#549)
    • [x] Add HWDecoderFPS Benchmark to measure the FPS of hardware decoder performance (#560)
    • [x] Update Docker image for H100 support (#577)
    • [x] Update MLC version into 3.10 for CUDA/ROCm dockerfile (#562)
    • [x] Bug fix for GPU Burn test (#567)
    • [x] Support INT8 in cublaslt function (#574)
    • [x] Add hipBLASLt function benchmark (#576)
    • [x] Support cpu-gpu and gpu-cpu in ib-validation (#581)
    • [x] Support graph mode in NCCL/RCCL benchmarks for latency metrics (#583)
    • [x] Support cpp implementation in distributed inference benchmark (#586 and #596)
    • [x] Add O2 option for gpu_copy ROCm build (#589)
    • [x] Support different hipblasLt data types in dist_inference (#590 and #603)
    • [x] Support in-place in NCCL/RCCL benchmark (#591)
    • [x] Support data type option in NCCL/RCCL benchmark (#595)
    • [x] Improve P2P performance with fine-grained GPU memory in GPU-copy test for AMD GPUs (#593)
    • [x] Update hipblaslt GEMM metric unit to tflops (#604)
    • [x] Support FP8 for hipblaslt benchmark (#605)

Model Benchmark Improvement

    • [x] Change torch.distributed.launch to torchrun (#556)
    • [x] Support Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark (#582 and #600)

Result Analysis

    • [x] Support baseline generation from multiple nodes (#575)

Backlog

Micro-benchmark Improvement

  1. Support cuDNN Backend API in cudnn-function.

Model Benchmark Improvement

  1. Support VGG, LSTM, and GPT-2 small in TensorRT Inference Backend
  2. Support VGG, LSTM, and GPT-2 small in ORT Inference Backend
  3. Support more TensorRT parameters (Related to #366)