Bus error finetuning whisper model in multi GPU instances

huggingface / blog

Public repo for HF blog posts

https://hf.co/blog

2.41k stars 759 forks source link

Bus error finetuning whisper model in multi GPU instances #1681

Open hitesh-ag1 opened 11 months ago

hitesh-ag1 commented 11 months ago

Hi, I am trying to finetune Whisper according to the blog post here. The finetuning works great in a single GPU scenario, however, fails with multi GPU instances. While executing trainer.train(), multi GPU instances return Bus error (core dumped).

I am working on g5.12xlarge instance for multi GPU on AWS with AMI ID: ami-071323fe2bf59945b on Ubuntu. I would appreciate any guidance or suggestions to resolve this issue.

hitesh-ag1 commented 11 months ago

cc @sanchit-gandhi Any help would be greatly appreciated!

dkrystki commented 11 months ago

Happening to me as well.

sanchit-gandhi commented 8 months ago

Hey @hitesh-ag1, sorry for the late reply here. Could you confirm that you're using one-to-one the same code as with single-GPU fine-tuning? Could you also provide the full stack trace for the error that you're getting? For interest, there's a multi-gpu example for Whisper fine-tuning that you can check out in the Transformers library.