[BUG] Failure when trying to use bf16 for RLHF on ROCM -- missing qkv_gemm op.

Describe the bug When using deepspeed-chat RLHF on ROCM/AMD, it crashes if I use bf16 (fp16 works on AMD, both work on NVIDIA). This seems to be because enable_bf16 is never set in op_builder/builder.py if using pytorch-rocm.

To Reproduce conda activate myenv [package list attached] git clone https://github.com/microsoft/DeepSpeedExamples.git cd DeepSpeedExamples/applications/DeepSpeed-Chat/ pip install -r requirements.txt cd training/step3_rlhf_finetuning PYTHONPATH=../.. deepspeed --num_gpus 1 main.py --actor_model_name_or_path facebook/opt-350m --critic_model_name_or_path facebook/opt-350m --actor_zero_stage 3 --critic_zero_stage 3 --num_padding_at_beginning 1 --gradient_accumulation_steps 2 --deepspeed --actor_lora_dim 128 --enable_hybrid_engine --actor_gradient_checkpointing --actor_dropout 0.0 --dtype bf16

Expected behavior Runs training without crashing (behavior seen without --dtype bf16)

ds_report output Attached.

Screenshots Output from run attached

System info (please complete the following information):

Ubuntu 20.04, ROCM 5.4.2 in Docker container
x16 AMD MI250
Python 3.11

Launcher context deepspeed launcher

Docker context Yes but cannot share image.

Additional context See attached ds_report, conda package list, and output file. deepspeed_bf16_rocm.log ds_report.txt package_list.txt

microsoft / DeepSpeed

[BUG] Failure when trying to use bf16 for RLHF on ROCM -- missing qkv_gemm op. #4717

4698