help on understanding the implementation of FSDP.

I don't understand the implementation of FSDP. I think in order to use FSDP, the model needs to be wrapped by FullyShardedDataParallel (or FSDP) imported from torch.distributed.fsdp. But I can not find anywhere in this repo uses these two keywords.

So I don't think FSDP is implemented natively, but rather, perhaps, some combination to zero2&3 from Deepspeed?. I am missing something?

In fact, some other issues is related to this question. Basically the model is not shared, but each GPU holds one full copy of the model. For example issue 5140 4912.

Thank you very much in advance for giving some hints.

hiyouga / LLaMA-Factory

help on understanding the implementation of FSDP. #5441