Fine Tuning Mixtral 8x7b

BedirT commented 3 months ago

Hi there,

Thanks for the scripts and posts! I am interested in fine-tuning Mixtral 8x7b on sagemaker. The task I have requires around 8k token length.

I have tried running training following this tutorial: https://solano-todeschini.medium.com/fine-tune-mixtral-8x7b-on-aws-sagemaker-and-deploy-to-runpod-6bbb79981d7b#31b4, but using this updated script instead https://www.philschmid.de/sagemaker-train-evalaute-llms-2024.

The first post uses a ml.g5.24xlarge instance, which, funnily enough, has no sharding or fsdp parameter set up. When I try running the same setup with increased context length, I get an OOM. I went up to a ml.g5.48xlarge instance with 192 GB VRAM, but nothing changed.

I also looked into this: https://www.philschmid.de/sagemaker-fsdp-gpt and tried setting fsdp up by adding 'fsdp': '"full_shard auto_wrap"'

The estimated setup cost, according to this chart https://github.com/hiyouga/LLaMA-Factory#hardware-requirement, should be around 30-60 GB for the model? How much does the context length affect?

I also saw that here https://github.com/philschmid/sagemaker-huggingface-llama-2-samples/blob/master/training/sagemaker-notebook.ipynb you are using a much larger instance, but not sure if that's because that was old or not.

PS: I am using pretty much the same parameters you do in the post except the max_seq_len, with the addition of fsdp

Any insight would be greatly appreciated.

philschmid commented 3 months ago

Are you trying to fine-tune Mixtral with Qlora and FSDP on a g5.48xlarge?

BedirT commented 3 months ago

Yes, I tried 24xlarge initially but upped it since i was getting OOM

BedirT commented 2 months ago

I still couldn't resolve this, any idea on how to approach? HF docs are pretty weak on Sagemaker + Model Parellel Distributed Training

BedirT commented 2 months ago

Also accelerate doesn't seem to have Model Parallel option still

philschmid / llm-sagemaker-sample

Fine Tuning Mixtral 8x7b #16