Open geekyGoku opened 7 months ago
There is no hard and fast rule for setting the training config: you are free to experiment with different frameworks (FSDP/DeepSpeed) and benchmark the performance that you get. Using accelerate, switching between them should be trivial: https://huggingface.co/docs/accelerate/usage_guides/deepspeed
This is the 'simplest' config you can set-up, and should be sufficient for single GPU training with reasonable batch sizes. Should you want to scale to larger global batch sizes, you can try a multi GPU set-up (to enable data parallelism).
I tried FSDP and DeepSpeed by adding to accelerate config, but these methods do not support two models(teacher model and student model)
Interesting! We pass the teacher_model
through accelerate.prepare
, so I would have thought it would be possible:
https://github.com/huggingface/distil-whisper/blob/914dcdf3919552d5a3826a9d5db99b059ddcc16e/training/run_distillation.py#L1334-L1337
Since it's not, feel free to open an issue on the Accelerate repo so we can keep track of this incompatibility: https://github.com/huggingface/accelerate/issues/new?assignees=&labels=&projects=&template=bug-report.yml
Did you have any luck here @shuaijiang?
Hi,
How's the training performed? Like was it done using FSDP or DeepSpeed?
Can you please provide Accelerate config used for training?
Thanks