Release Manager

@cp5555

Endgame

- [x] Add HPL random generator to gemm-flops with ROCm (#578)
- [x] Add DirectXGPURenderFPS Benchmark to measure the FPS of rendering simple frames (#549)
- [x] Add HWDecoderFPS Benchmark to measure the FPS of hardware decoder performance (#560)
- [x] Update Docker image for H100 support (#577)
- [x] Update MLC version into 3.10 for CUDA/ROCm dockerfile (#562)
- [x] Bug fix for GPU Burn test (#567)
- [x] Support INT8 in cublaslt function (#574)
- [x] Add hipBLASLt function benchmark (#576)
- [x] Support cpu-gpu and gpu-cpu in ib-validation (#581)
- [x] Support graph mode in NCCL/RCCL benchmarks for latency metrics (#583)
- [x] Support cpp implementation in distributed inference benchmark (#586 and #596)
- [x] Add O2 option for gpu_copy ROCm build (#589)
- [x] Support different hipblasLt data types in dist_inference (#590 and #603)
- [x] Support in-place in NCCL/RCCL benchmark (#591)
- [x] Support data type option in NCCL/RCCL benchmark (#595)
- [x] Improve P2P performance with fine-grained GPU memory in GPU-copy test for AMD GPUs (#593)
- [x] Update hipblaslt GEMM metric unit to tflops (#604)
- [x] Support FP8 for hipblaslt benchmark (#605)

- [x] Change torch.distributed.launch to torchrun (#556)
- [x] Support Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark (#582 and #600)