Open vdabravolski opened 10 months ago
cc @pacman100
FSDP support for fp8 is experimental and is on NVIDIA's roadmap (with currently no public prototype yet). We need to wait on them.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
I'm trying to launch multi-node multi-gpu Llama-2 for continued pretraining. My training script is using Accelerate to setup distributed environment and HF Transfomers Trainer to execute the training loop. I'd like to use FP8 precision with FSDP plugin, but seeing issues.
Below are some details on how to reproduce the issue. In my example I omitted some custom code which distributed the tasks and preparers the data to make it more simple. Let me know if any key details are missing.
I start training script with following command line which runs on each machine in multi-node environment):
where train_module.train() is a custom wrapper on top of HuggingFace Trainer class with minimal changes to it.
When running my script with
--mixed_precision=bf16
, the script works as expected, the model is successfully sharded across GPUs, training starts and loss decreases.However, when passing
--mixed_precision=fp8
I'm getting following error:Looking into stacktrace I can see that while accelerate CLI supports --mixed_precision=fp8 (reference) FSDP plugin seems to only support "no", "fp16" or "bf16" options (reference)
Can you please confirm that my understanding is correct, that Accelerate supports FP8 only withoud Zero-3 sharding frameworks (e.g. FSDP or DeepSpeed). If my understanding is correct, does Accelerate team have a timeline to add FP8 support to FSDP Plugin?
Expected behavior
I expect that both bf16 and fp8 to work similarly.