huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.97k stars 970 forks source link

using deepspeed original json config, when using bf16, get the error RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != c10::Half. #3197

Open PMPBinZhang opened 4 weeks ago

PMPBinZhang commented 4 weeks ago

System Info

- `Accelerate` version: 1.0.0
- Platform: Linux-6.8.0-47-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /home/user/anaconda3/envs/accelerate_multi/bin/accelerate
- Python version: 3.11.0
- Numpy version: 1.23.5
- PyTorch version (GPU?): 2.1.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 62.55 GB
- GPU type: NVIDIA GeForce RTX 4090
- `Accelerate` default config:

Information

Tasks

Reproduction

  1. the accelerate config as follows: compute_environment: LOCAL_MACHINE debug: true deepspeed_config: deepspeed_config_file: /home/user/work/screenplays_sft/ds_zero3_cpu_offload.config zero3_init_flag: true deepspeed_multinode_launcher: standard main_process_ip: 192.168.252.20 main_process_port: 25253 distributed_type: DEEPSPEED downcast_bf16: true enable_cpu_affinity: false machine_rank: 0 main_training_function: main num_machines: 2 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
  2. ds_zero3_cpu_offload.config as follows: { "bf16": { "enabled": "auto" }, "zero_optimization": { "stage": 3, "stage3_gather_16bit_weights_on_model_save": false, "allgather_bucket_size": 5e8, "reduce_bucket_size": 5e8, "contiguous_gradients": true, "zero_quantized_weights": true, "zero_quantized_gradients": true, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true } }, "gradient_clipping": "auto", "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": "auto", "comms_logger": { "enabled": true, "verbose": true, "prof_all": true, "debug": false } }
  3. scripts command as follows: accelerate launch --config_file multi_nodes_single_gpu_deepspeed_zero3_cfg_file.yaml sft_trainer.py --log_level info --bf16 True
  4. 4bit load code as follows: bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_storage=torch.bfloat16 )

model = AutoModelForCausalLM.from_pretrained(base_model_path, quantization_config=bnb_config, torch_dtype=torch.bfloat16)

  1. and if I configure deepspeed in accelerate config, then I can use bf16, the config as follows: compute_environment: LOCAL_MACHINE debug: true deepspeed_config: deepspeed_multinode_launcher: standard offload_optimizer_device: cpu offload_param_device: cpu zero3_init_flag: true zero3_save_16bit_model: false zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' enable_cpu_affinity: false machine_rank: 0 main_process_ip: 192.168.252.20 main_process_port: 25253 main_training_function: main num_machines: 2 num_processes: 2 mixed_precision: bf16 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: true
  2. If I add mixed_precision: bf16 to 1. config file like this compute_environment: LOCAL_MACHINE debug: true deepspeed_config: deepspeed_config_file: /home/user/work/screenplays_sft/ds_zero3_cpu_offload.config zero3_init_flag: true deepspeed_multinode_launcher: standard main_process_ip: 192.168.252.20 main_process_port: 25253 distributed_type: DEEPSPEED downcast_bf16: true mixed_precision: bf16 enable_cpu_affinity: false machine_rank: 0 main_training_function: main num_machines: 2 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false the I got error "ValueError: When using deepspeed_config_file, the following accelerate config variables will be ignored: ['gradient_accumulation_steps', 'gradient_clipping', 'zero_stage', 'offload_optimizer_device', 'offload_param_device', 'offload_param_nvme_path', 'offload_optimizer_nvme_path', 'zero3_save_16bit_model', 'mixed_precision']."

could you please tell me the reason, thank you very much.

Expected behavior

using bf16 to train when using deepspeed original json config file