huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.32k stars 872 forks source link

Plan to support FSDP2? #2873

Open ByronHsu opened 1 week ago

ByronHsu commented 1 week ago

FSDP2 provides smaller memory footprint, compatibility with torch compile, and more flexibility due to per param sharding. Does huggingface have plan to support FSDP2?

https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md

BenjaminBossan commented 1 week ago

Thanks for bringing FSDP2 to our (or at least my) attention. The changes described in the document you linked sound very reasonable and could remove some of the common pain points of using FSDP.

Reading this, it got the impression that this is a very new addition to PyTorch. When searching for fully_shard in the PyTorch docs, there is no hit, which reinforces this impression. But looking at the actual code, it's already 2 years old! So I'm confused now about the state of this feature: Is it going to be officially released soon or is it more of an experimental feature that may or may not see continued work? Do you have any insights on that @ByronHsu?

ByronHsu commented 1 week ago

Thanks @BenjaminBossan! If I understand correctly, PyTorch team wants to replace FSDP1 with FSDP2 in the long term. I saw it has already been integrated in torchtitan. Maybe we can have some plans for accelerate too? Otherwise, users cannot use torch compile with FSDP in hf. cc PyTorch team @awgu @msaroufim

awgu commented 1 week ago

But looking at the actual code, it's already 2 years old!

Very sorry for the confusion! There are two separate functions called fully_shard, one being 2 years old and one being new from this year. For historical context, we were experimenting with approaches to implementing FSDP that were not an nn.Module wrapper like FullyShardedDataParallel. This led to the distributed/_composable folder, and the APIs were all verbs, hence fully_shard. The original fully_shard called into the same underlying code as FullyShardedDataParallel. The new fully_shard (FSDP2) is a standalone implementation.

We proposed FSDP2 as prototype for 2.4 release, and we are investing in it heavily.

BenjaminBossan commented 1 week ago

Thanks a lot for clarifying my confusion. In that case, I think it makes sense to wait until FSDP2 is released and then run experiments with accelerate to see how it can be best supported.

muellerzr commented 1 day ago

The main worry with FSDPv2 is if it's stable enough that it makes sense to include it in Accelerate. At the worst case, we can keep a draft PR open and/or an experimental feature (and advertise it as such).

So my main question is:

I planned on looking into FSDP2 in the near future anyways, so I'm open to having some early-ish support in Accelerate for it as long as I can get a full grasp of how long into the development it is.

(We did something similar with PiPPy, so okay do so here too)

I know we need to do some heavy uprooting to add in custom process support into Accelerate, which I believe FSDP2 relies on if I'm not mistaken?

muellerzr commented 1 day ago

What'd be helpful on my end is some bare-bones FSDP2 examples in PyTorch with how things are operating end-to-end