Open noforit opened 8 months ago
Could I kindly inquire as to why, given the relatively small size of the tinyllama model, the Strategy was made to utilize FSDP (Fully Sharded Data Parallel) instead of DDP (Distributed Data Parallel)?
Hi, I also want to know why FSDP is preferred here since a 1.1B model could fit into a A100-40G with bsz=1.
Could I kindly inquire as to why, given the relatively small size of the tinyllama model, the Strategy was made to utilize FSDP (Fully Sharded Data Parallel) instead of DDP (Distributed Data Parallel)?