Open bumbawumba opened 8 months ago
Describe the bug When using deepspeed-chat RLHF on ROCM/AMD, it crashes if I use bf16 (fp16 works on AMD, both work on NVIDIA). This seems to be because enable_bf16 is never set in op_builder/builder.py if using pytorch-rocm.
To Reproduce conda activate myenv [package list attached] git clone https://github.com/microsoft/DeepSpeedExamples.git cd DeepSpeedExamples/applications/DeepSpeed-Chat/ pip install -r requirements.txt cd training/step3_rlhf_finetuning PYTHONPATH=../.. deepspeed --num_gpus 1 main.py --actor_model_name_or_path facebook/opt-350m --critic_model_name_or_path facebook/opt-350m --actor_zero_stage 3 --critic_zero_stage 3 --num_padding_at_beginning 1 --gradient_accumulation_steps 2 --deepspeed --actor_lora_dim 128 --enable_hybrid_engine --actor_gradient_checkpointing --actor_dropout 0.0 --dtype bf16
Expected behavior Runs training without crashing (behavior seen without --dtype bf16)
ds_report output Attached.
Screenshots Output from run attached
System info (please complete the following information):
Launcher context deepspeed launcher
Docker context Yes but cannot share image.
Additional context See attached ds_report, conda package list, and output file. deepspeed_bf16_rocm.log ds_report.txt package_list.txt
Describe the bug When using deepspeed-chat RLHF on ROCM/AMD, it crashes if I use bf16 (fp16 works on AMD, both work on NVIDIA). This seems to be because enable_bf16 is never set in op_builder/builder.py if using pytorch-rocm.
To Reproduce conda activate myenv [package list attached] git clone https://github.com/microsoft/DeepSpeedExamples.git cd DeepSpeedExamples/applications/DeepSpeed-Chat/ pip install -r requirements.txt cd training/step3_rlhf_finetuning PYTHONPATH=../.. deepspeed --num_gpus 1 main.py --actor_model_name_or_path facebook/opt-350m --critic_model_name_or_path facebook/opt-350m --actor_zero_stage 3 --critic_zero_stage 3 --num_padding_at_beginning 1 --gradient_accumulation_steps 2 --deepspeed --actor_lora_dim 128 --enable_hybrid_engine --actor_gradient_checkpointing --actor_dropout 0.0 --dtype bf16
Expected behavior Runs training without crashing (behavior seen without --dtype bf16)
ds_report output Attached.
Screenshots Output from run attached
System info (please complete the following information):
Launcher context deepspeed launcher
Docker context Yes but cannot share image.
Additional context See attached ds_report, conda package list, and output file. deepspeed_bf16_rocm.log ds_report.txt package_list.txt