Open premanand09 opened 10 months ago
a recipe for this would be very helpful
Yes. I need a suggestion here if keeping FSDP implementation makes sense for managed distributed training in sagemaker. Kindly help.
+1
Hi! I found this tutorial from Sagemaker, hopefully it can solve this issue.
Hi, I am going to do distributed training of llama on aws sagemaker as managed training across multiple devices/nodes. Sagemaker provides data parallel and model parallel distributed training in sagemaker. SInce sagemaker already takes care of distributed training, do i need to keep current FSDP implementation of llama fine tuning script? or should i remove it?