[Epic] Replace DeepSpeed with PyTorch FSDP for Model Training

ktam3 commented 2 months ago

Feature Overview This Feature card is for transitioning our model training infrastructure from DeepSpeed to PyTorch's Fully Sharded Data Parallel (FSDP) to enhance training metrics visibility, broaden accelerator support, and maintain performance parity.

Goals

Improve training metrics visibility for ML engineers and data scientists through integration with Weights & Biases.
Expand accelerator support for hardware flexibility.
Maintain or improve training performance across GPU configurations.

Requirements

Implement PyTorch FSDP as the primary distributed training framework, replacing DeepSpeed.
Integrate PyTorch FSDP with Weights & Biases for comprehensive training metrics collection and visualization.
Ensure compatibility with a broad range of accelerators (e.g., NVIDIA GPUs, AMD GPUs, TPUs).
Achieve performance parity or improvement compared to DeepSpeed on GPU configurations.
Implement and test CPU offload capabilities.
Update all relevant training scripts and documentation to reflect the transition to PyTorch FSDP.
Ensure security measures are in place for data handling during distributed training.
Maintain or improve the scalability of the training process.
(if applicable) Provide clear documentation on how to use the new PyTorch FSDP setup for different training scenarios.

Completion Checklist:

[ ] PyTorch FSDP implementation complete
[ ] Weights & Biases integration tested and functional
[ ] Performance benchmarks conducted across various accelerators
[ ] CPU offload capabilities evaluated and implemented if beneficial
[ ] All training scripts updated
[ ] Documentation updated
[ ] Security audit completed
[ ] Scalability tests passed

Questions to Answer

What is the performance impact of PyTorch FSDP on our specific model architectures?
How does the CPU offload capability of PyTorch FSDP compare to DeepSpeed?
Are there any specific optimizations needed for different accelerator types?
What changes are required in our CI/CD pipeline to accommodate this transition?

Out of Scope

Modifications to model architectures
Changes to data preprocessing pipelines
Alterations to evaluation metrics or procedures

Background Our current training infrastructure uses DeepSpeed for distributed training. While effective, transitioning to PyTorch FSDP offers strategic advantages in terms of metrics visibility, accelerator support, and potential performance improvements.

User Considerations

Ensure that the transition is seamless for end-users of our models.
Communicate any changes in training times or resource requirements to relevant stakeholders.

Documentation Considerations

Update all training-related documentation to reflect the use of PyTorch FSDP.
Provide migration guides for users transitioning from DeepSpeed to PyTorch FSDP.
(if applicable) Document any changes in command-line arguments or configuration files needed for PyTorch FSDP.

Additional notes wrt to FSDP -

We would like to feature-gate FSDP support, so RHEL AI 1.2 will use deepspeed by default, but FSDP support will be available if enabled via a feature gate.
As a feature-gated feature, FSDP support could be considered tech preview for RHEL AI 1.2.
FSDP may not support all hardware. We will aim for this, but at a minimum it will support nvidia.
FSDP is a high-priority feature because it is needed by OpenShift AI to deliver FSDP to watsonx.ai.

ktam3 commented 2 months ago

Summary from discussion:

Additional discussion needs to be held to break down this work and create a step by step list of tasks that need to be done
Team needs to identify the folowing
- Scope:
  - Hardware availability
  - Dependencies
  - What can the team actually commit to in the next 3 weeks?

@RobotSail @JamesKunstle @Maxusmusti - to follow up and create issues and link to this epic as the work is being done

ktam3 commented 2 months ago

@RobotSail - additional notes wrt to support

We will build Intel and AMD for FSDP only. That allows us to avoid working on making DeepSpeed compile for Gaudi, which should drop a significant amount of work.
If we have issues with FSDP for either variant, we will either declare that variant’s release a preview, or not deliver it at all.
For Nvidia, to maintain backwards compatibility, we will keep DeepSpeed as the default but also provide a flag to let the user enable FSDP.
There is work in InstructLab to enable FSDP, add the flag, etc. and there is work to build Torch with FSDP (we think it’s on by default, but need to confirm).

RobotSail commented 2 months ago

With regards to FSDP - the main risks that we still need to overcome are going to be:

LoRA Getting LoRA to work properly, since DeepSpeed was very compatible with running PEFT models, FSDP will require more work on our end to get this working.

Checkpointing

In our current implementation, we run DeepSpeed with ZeRO stage-2, which allows us to save a model checkpoint by taking its state on one of the running GPUs because the models are simply replicated across all GPUs. DeepSpeed implements all ZeRO stages, but we are only using stage-2 at the moment.

zero stages listed for reference:

Stage 1: Partitions optimizer states.
Stage 2: Partitions optimizer states and gradients.
Stage 3: Partitions optimizer states, gradients, and model parameters.

FSDP on the other hand only supports ZeRO stage-3 or no offloading at all. So for this reason, it wouldn't be straightforward to feature-gate DeepSpeed as-is without also providing ZeRO-3 support there as well.

We'll need to make sure that this is all tested against the full matrix of the devices we intend to support as well

RobotSail commented 2 months ago

The following issues now are a part of this epic:

200
201
202

JamesKunstle commented 2 months ago

Adding the following general issue as well:

211

I'll be working on converting / testing this code on Gaudi 2 cards in the multi-GPU case as well.

JamesKunstle commented 1 month ago

@Maxusmusti @RobotSail It sounds like this is being solved by @aldopareja's PR that uses Accelerate, and Mustafa's work that enables LoRA checkpointing. What do we need to do to finish this / get it tested?

Maxusmusti commented 1 month ago

@JamesKunstle we should sync on this tmrw either before or after meetings today to make sure we have everything. Checkpoint resuming and a lot of fsdp testing will def be needed this week, and we still need to bring back padding free via hf transformers Granite model class.

ktam3 commented 1 month ago

@JamesKunstle - can we close this epic if it's done?

ktam3 commented 1 month ago

Closing this as discussed in chat. Feel free to reopen if it's incorrect

instructlab / training

[Epic] Replace DeepSpeed with PyTorch FSDP for Model Training #197

200

201

202

211