philschmid / llm-sagemaker-sample

Apache License 2.0
37 stars 17 forks source link

Fine Tuning Mixtral 8x7b #16

Open BedirT opened 3 months ago

BedirT commented 3 months ago

Hi there,

Thanks for the scripts and posts! I am interested in fine-tuning Mixtral 8x7b on sagemaker. The task I have requires around 8k token length.

I have tried running training following this tutorial:, but using this updated script instead

The first post uses a ml.g5.24xlarge instance, which, funnily enough, has no sharding or fsdp parameter set up. When I try running the same setup with increased context length, I get an OOM. I went up to a ml.g5.48xlarge instance with 192 GB VRAM, but nothing changed.

I also looked into this: and tried setting fsdp up by adding 'fsdp': '"full_shard auto_wrap"'

The estimated setup cost, according to this chart, should be around 30-60 GB for the model? How much does the context length affect?

I also saw that here you are using a much larger instance, but not sure if that's because that was old or not.

PS: I am using pretty much the same parameters you do in the post except the max_seq_len, with the addition of fsdp

Any insight would be greatly appreciated.

philschmid commented 3 months ago

Are you trying to fine-tune Mixtral with Qlora and FSDP on a g5.48xlarge?

BedirT commented 3 months ago

Yes, I tried 24xlarge initially but upped it since i was getting OOM

BedirT commented 2 months ago

I still couldn't resolve this, any idea on how to approach? HF docs are pretty weak on Sagemaker + Model Parellel Distributed Training

BedirT commented 2 months ago

Also accelerate doesn't seem to have Model Parallel option still