huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.77k stars 27.18k forks source link

launch deepspeed in mixed precision fp8 using HF Trainer is not working #34027

Open eljandoubi opened 1 month ago

eljandoubi commented 1 month ago

System Info

Capture d'écran 2024-10-08 114022 acc_cfg.yml:

compute_environment: LOCAL_MACHINE
debug: false deepspeed_config: deepspeed_multinode_launcher: standard gradient_accumulation_steps: auto gradient_clipping: 1.0 offload_optimizer_device: cpu offload_param_device: cpu zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 0 main_process_ip: 0.0.0.0 main_process_port: 0 main_training_function: main mixed_precision: fp8 fp8_config: amax_compute_algorithm: max amax_history_length: 1024 backend: TE fp8_format: HYBRID interval: 1 margin: 0 override_linear_precision: false use_autocast_during_eval: true num_machines: 3 num_processes: 24 rdzv_backend: etcd-v2 same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

Who can help?

No response

Information

Tasks

Reproduction

accelerate launch --config_file acc_cfg.yml train.py $TRAINING_ARGS the train.py is any training script that train using transformers.Trainer $TRAINING_ARGS are the TrainingArguments plus some path to data image

Expected behavior

Deepseed do not capture that the mixed precision is fp8 and it switches to bf16.

Rocketknight1 commented 1 week ago

Maybe cc @SunMarc @muellerzr!