llama 2 distributed training on AWS Sagemaker

meta-llama / llama-recipes

Scripts for fine-tuning Meta Llama3 with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization and Q&A. Supporting a number of candid inference solutions such as HF TGI, VLLM for local or cloud deployment. Demo apps to showcase Meta Llama3 for WhatsApp & Messenger.

10.3k stars 1.46k forks source link

llama 2 distributed training on AWS Sagemaker #130

Open premanand09 opened 10 months ago

premanand09 commented 10 months ago

Hi, I am going to do distributed training of llama on aws sagemaker as managed training across multiple devices/nodes. Sagemaker provides data parallel and model parallel distributed training in sagemaker. SInce sagemaker already takes care of distributed training, do i need to keep current FSDP implementation of llama fine tuning script? or should i remove it?

aabayarea commented 10 months ago

a recipe for this would be very helpful

premanand09 commented 10 months ago

Yes. I need a suggestion here if keeping FSDP implementation makes sense for managed distributed training in sagemaker. Kindly help.

gemmert-lab49 commented 8 months ago

wukaixingxp commented 4 weeks ago

Hi! I found this tutorial from Sagemaker, hopefully it can solve this issue.