Open BedirT opened 3 months ago
Are you trying to fine-tune Mixtral with Qlora and FSDP on a g5.48xlarge?
Yes, I tried 24xlarge initially but upped it since i was getting OOM
I still couldn't resolve this, any idea on how to approach? HF docs are pretty weak on Sagemaker + Model Parellel Distributed Training
Also accelerate doesn't seem to have Model Parallel option still
Hi there,
Thanks for the scripts and posts! I am interested in fine-tuning Mixtral 8x7b on sagemaker. The task I have requires around 8k token length.
I have tried running training following this tutorial: https://solano-todeschini.medium.com/fine-tune-mixtral-8x7b-on-aws-sagemaker-and-deploy-to-runpod-6bbb79981d7b#31b4, but using this updated script instead https://www.philschmid.de/sagemaker-train-evalaute-llms-2024.
The first post uses a
ml.g5.24xlarge
instance, which, funnily enough, has no sharding orfsdp
parameter set up. When I try running the same setup with increased context length, I get an OOM. I went up to aml.g5.48xlarge
instance with 192 GB VRAM, but nothing changed.I also looked into this: https://www.philschmid.de/sagemaker-fsdp-gpt and tried setting
fsdp
up by adding'fsdp': '"full_shard auto_wrap"'
The estimated setup cost, according to this chart https://github.com/hiyouga/LLaMA-Factory#hardware-requirement, should be around 30-60 GB for the model? How much does the context length affect?
I also saw that here https://github.com/philschmid/sagemaker-huggingface-llama-2-samples/blob/master/training/sagemaker-notebook.ipynb you are using a much larger instance, but not sure if that's because that was old or not.
PS: I am using pretty much the same parameters you do in the post except the
max_seq_len
, with the addition offsdp
Any insight would be greatly appreciated.