huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
MIT License
3.33k stars 238 forks source link

Does this use FSDP or Deepspeed? Req. for accelerate config #40

Open geekyGoku opened 7 months ago

geekyGoku commented 7 months ago

Hi,

How's the training performed? Like was it done using FSDP or DeepSpeed?

Can you please provide Accelerate config used for training?

Thanks

sanchit-gandhi commented 7 months ago

There is no hard and fast rule for setting the training config: you are free to experiment with different frameworks (FSDP/DeepSpeed) and benchmark the performance that you get. Using accelerate, switching between them should be trivial: https://huggingface.co/docs/accelerate/usage_guides/deepspeed

The 'base' accelerate config used for the PyTorch experiments was: ------------------------------------------------------------------------------------------------------------------------In which compute environment are you running? **This machine** ------------------------------------------------------------------------------------------------------------------------Which type of machine are you using? **No distributed training** Do you want to run your training on CPU only (even if a GPU / Apple Silicon / Ascend NPU device is available)? [yes/NO]:**No** Do you wish to optimize your script with torch dynamo?[yes/NO]:**No** Do you want to use DeepSpeed? [yes/NO]: **No** What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:**0** ------------------------------------------------------------------------------------------------------------------------Do you wish to use FP16 or BF16 (mixed precision)? **bf16**

This is the 'simplest' config you can set-up, and should be sufficient for single GPU training with reasonable batch sizes. Should you want to scale to larger global batch sizes, you can try a multi GPU set-up (to enable data parallelism).

config yaml ```yaml compute_environment: LOCAL_MACHINE debug: false distributed_type: 'NO' downcast_bf16: 'no' gpu_ids: '0' machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false ```
shuaijiang commented 7 months ago

I tried FSDP and DeepSpeed by adding to accelerate config, but these methods do not support two models(teacher model and student model)

sanchit-gandhi commented 7 months ago

Interesting! We pass the teacher_model through accelerate.prepare, so I would have thought it would be possible: https://github.com/huggingface/distil-whisper/blob/914dcdf3919552d5a3826a9d5db99b059ddcc16e/training/run_distillation.py#L1334-L1337

Since it's not, feel free to open an issue on the Accelerate repo so we can keep track of this incompatibility: https://github.com/huggingface/accelerate/issues/new?assignees=&labels=&projects=&template=bug-report.yml

sanchit-gandhi commented 6 months ago

Did you have any luck here @shuaijiang?