openxla / xla

A machine learning compiler for GPUs, CPUs, and ML accelerators
Apache License 2.0
2.4k stars 361 forks source link

[ROCm] Add script to run multi gpu tests #14364

Open hsharsha opened 3 days ago

hsharsha commented 3 days ago

This is a rocm specific script housed under build_tools/rocm It runs following distributed tests which require more >= 4 gpus and these tests are skipped currently in the CI due to tag selection. These tests are tagged either as manual or with oss

//xla/tests:collective_ops_e2e_test_gpu_amd_any 
//xla/tests:collective_ops_test_gpu_amd_any 
//xla/tests:replicated_io_feed_test_gpu_amd_any 
//xla/tools/multihost_hlo_runner:functional_hlo_runner_test_gpu_amd_any 
//xla/pjrt/distributed:topology_util_test 
//xla/pjrt/distributed:client_server_test

Also these tests do not use --run_under=//tools/ci_build/gpu_build:parallel_gpu_execute with bazel which locks down individual gpus thus making multi gpu tests impossible to run

Eventually we would like to enable these tests in a separate pipeline to get better test coverage