hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
34.45k stars 4.25k forks source link

help on understanding the implementation of FSDP. #5441

Open jq-wei opened 2 months ago

jq-wei commented 2 months ago

I don't understand the implementation of FSDP. I think in order to use FSDP, the model needs to be wrapped by FullyShardedDataParallel (or FSDP) imported from torch.distributed.fsdp. But I can not find anywhere in this repo uses these two keywords.

So I don't think FSDP is implemented natively, but rather, perhaps, some combination to zero2&3 from Deepspeed?. I am missing something?

In fact, some other issues is related to this question. Basically the model is not shared, but each GPU holds one full copy of the model. For example issue 5140 4912.

Thank you very much in advance for giving some hints.