[HELP] How to safely switch trainable parameters in ZeRO-3 stage?

Ledzy commented 2 weeks ago

Thank you for your great contribution!

Problem I want to solve

I would like to know how to safely switch the trainable parameters during ZeRO-3 stage. Consider a network with 2 layers, my objective is to train the first layer for fixed number of steps, and then switch to the second layer to train for the same number of iterations, then switch back to the first layer and repeat the procedure.

Encountered issues and observations

A straightforward way is to change the requires_grad attribute of the active layer. However, If i set 1st layer's requires_grad=True and 2nd layer's requires_grad=False in this beginning, then the only the 1st layer will be updated, even when the 2nd layer's requires_grad becomes True in the later training phase.

In particular, the averaged_gradients of the DeepSpeedZeroOptimizer_Stage3 is a all zero vector when training the 2nd layer. I found that the corresponding param.requires_grad is False in this line, even when I have set the original model's 2nd layer requires_grad=True. I suppose this issue is related to Deepspeed's gradient synchronization mechanism. I guess the deepspeed.initialize already determines which parameters should do gradient synchronization / need gradient, based on the requires_grad attribute, just like the pytorch DDP.

I have spent weeks in resolving this issue but still cannot find a clean and feasible approach except re-run deepspeed.initialize each time, which is too time-consuming and is not convenient when combining other frameworks like Huggingface's Trainer. Could you offer me some guidance on this problem? Switching trainable parameters may be an important feature for memory-efficient optimization of LLM. Any help would be greatly appreciated!

For your reference, here is the code i use to test ZeRO-3. The badam.BlockOptimizer wraps the original optimizer to automatically switch trainable parameters for fixed number of iterations, which you may use it via pip install badam. Its logic is rather straightforward; see source code here.

import torch
from torch import nn
import deepspeed
from torch.utils.data import Dataset, DataLoader
from badam import BlockOptimizer
from torch.optim import AdamW

USE_DS=True

torch.manual_seed(123)
loss_fn = nn.BCELoss()

class SimpleDataset(Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]

# Define a simple two-layer neural network
class TwoLayerNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(TwoLayerNet, self).__init__()
        self.layers = nn.ModuleList([nn.Linear(input_size, hidden_size), nn.Linear(hidden_size, output_size)])
        self.device = torch.device('cuda')

    def forward(self, x):
        x = torch.relu(self.layers[0](x))
        x = torch.sigmoid(self.layers[1](x))
        return x

model = TwoLayerNet(input_size=10, hidden_size=50, output_size=2).to('cuda')

# Make the optimizer change the updated layer periodically
optimizer = BlockOptimizer(base_optimizer=AdamW(model.parameters(), lr=1e-3), 
                           named_parameters_list=list(model.named_parameters()),
                           verbose=2, switch_mode="descending") # descending means update the last layer first

# Initialize DeepSpeed
if USE_DS:
    model, optimizer, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config_params="ds_config_badam.json")

# Create synthetic data for experiment
data_num = 300
data = torch.randn(data_num, 10, dtype=next(model.parameters()).dtype)  # 100 samples, 10 features each
labels = torch.rand(data_num, 2, dtype=next(model.parameters()).dtype)
dataset = SimpleDataset(data, labels)
data_loader = DataLoader(dataset, batch_size=32)

for epoch in range(10): 
    for batch in data_loader:
        inputs, targets = batch
        inputs = inputs.to(model.device)
        targets = targets.to(model.device)

        # When using ZeRO-3, the layer 1 doesn't do update and its norm remain the same.
        if torch.distributed.get_rank() == 0:
            print(f"layer 0: {torch.norm(model.layers[0].weight.ds_tensor):.7f}, layer 1: {torch.norm(model.layers[1].weight.ds_tensor):.7f}")

        outputs = model(inputs)
        loss = loss_fn(outputs, targets)

        # Backward pass and optimization
        if USE_DS:
            model.backward(loss)
            model.step()
        else:
            loss.backward()
            optimizer.step()

Here is the configuration file "ds_config_badam.json" that i used

{
    "train_batch_size": 32,
    "train_micro_batch_size_per_gpu": 16,
    "steps_per_print": 2000,
    "fp16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 3,
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true,
        "cpu_offload": false
    },
    "activation_checkpointing": {
        "partition_activations": true,
        "cpu_checkpointing": true,
        "number_checkpoints": 4,
        "synchronize_checkpoint_boundary": false,
        "contiguous_memory_optimization": false
    },
    "zero_allow_untested_optimizer": true,
    "gradient_accumulation_steps": 1,
    "wall_clock_breakdown": false,
    "local_rank": 0,
    "deepseed_config": {
        "zero_optimization_stage": 3
    }
}

Please let me know if you need any additional information. Thank you so much for your time and your effort! cc @tjruwase

Ledzy commented 2 weeks ago

I managed to solve this issue by

remove the previous gradient synchronization hooks
reset the bookkeeping of ds optimizer
setup the fp16 groups and partition it
register necessary hooks for synchronizing gradients
create fp32 flat partition, initialize ipg buffer and grad partition buffer
invalidate the trace cache, since the module processing order has been changed

after switching the trainable parameters each time. This solution is more time-efficient compared to rerun deepspeed.initialize each time. Interested readers may refer to this repo, where i implement the main logics in _switch_trainable_params_zero3.

loadams commented 1 week ago

Thanks for adding your solution @Ledzy

microsoft / DeepSpeed

[HELP] How to safely switch trainable parameters in ZeRO-3 stage? #5639

Problem I want to solve

Encountered issues and observations