Closed dmammfl closed 1 month ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
I'm training Llama-2 model with FSDP on 4xH100(80G) and 4xA100(80G) server respectively.
7B model full fine-tuning and 70B PEFT work as expected(H100 is faster than A100), but when it comes to 70B FT, H100 training time is 1.5~2x slower than A100 server.
Here is total training time for A100 and H100 server.
Although there are many differences between two distinct servers, but it is weird that only 70B full fine-tuning has the problem. Do you have any idea for this situation?
My training codes are below
and training scripts are below
fsdp configurations
Expected behavior
H100 should be faster than A100 for full fine-tuning 70B model.