DeepSpeed reduces loss scale until it becomes less efficient

System Info

- `Accelerate` version: 0.26.1
- Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
- Python version: 3.11.5
- Numpy version: 1.26.3
- PyTorch version (GPU?): 2.1.2 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 31.28 GB
- GPU type: NVIDIA GeForce RTX 3070 Ti (6 of them)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: fp16
        - use_cpu: False
        - debug: False
        - num_processes: 6
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {
                    'gradient_accumulation_steps': 1,
                    'offload_optimizer_device': 'none',
                    'offload_param_device': 'none',
                    'zero3_init_flag': False,
                    'zero3_save_16bit_model': False,
                    'zero_stage': 3
                }
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

I am running some fine-tuning via Accelerate+DeepSpeed. DS progressively tries to decrease time per iteration, which it does. This is great... until it reaches a certain point and it starts increasing the time per iteration. I wanted to know if there was a way to stop reducing loss scale after a certain point (such as iteration time going back up) as well as the best way to go about using get_accelerator().empty_cache().

Here is some of my code followed by my accelerator yaml and how I run it.

# 1. Have Transformer's determine the best tokenizer for the given model
# 2. Convert XML to readable dataset. Have the first GPU run it first so multiple GPUs aren't trying to edit the XML at
#    the same time
# 3. Set the max length and padding of each eConsult and how wewant to tokenize the dataset
# 4. Split dataset into training dataset and eval 80/20
# 5. Distribute tokenized datasets across multiple GPUs as to not run out of memory
# 6. Create/return dataloader with the given data for the trainer to use
def get_dataloaders(accelerator: Accelerator, batch_size, model_name, data_location, save_location):
    # 1-4

    # 5
    train_sampler = DistributedSampler(
        tokenized_train_dataset, num_replicas=accelerator.num_processes, rank=accelerator.process_index, shuffle=True
    )

    eval_sampler = DistributedSampler(
        tokenized_eval_dataset, num_replicas=accelerator.num_processes, rank=accelerator.process_index, shuffle=False
    )

    # 6
    train_dataloader = DataLoader(
        tokenized_train_dataset,
        batch_size=batch_size,
        drop_last=True,
        sampler=train_sampler
    )

    eval_dataloader = DataLoader(
        tokenized_eval_dataset,
        batch_size=batch_size*2,
        drop_last=(accelerator.mixed_precision == "fp8"),
        sampler=eval_sampler
    )

    return train_dataloader, eval_dataloader

# 1. Initialize accelerator with mixed percision and define training parameters via arguments given in command line
# 2. Sets seed (if given as a command line argument) for reproducability
# 3. Get dataloaders
# 4. Initialize more training perameters and "prepare"/optimize them via Accelerate
# 5. Train/fine-tune model with new data & set parameters using FSDP
# 6. Evaluate quality of trainer for that epoch
# 7. Have the first GPU save the newly fine-tuned dataset
def training_function(args):
    # 1
    accelerator = Accelerator(mixed_precision=args.mixed_precision,
                              gradient_accumulation_steps=args.gradient_accumulation_steps)

    lr = args.lr
    num_epochs = args.num_epochs
    batch_size = args.batch_size
    num_warmup_steps = args.num_warmup_steps

    # 2
    if args.seed:
        set_seed(args.seed)

    # 3
    train_dataloader, eval_dataloader = get_dataloaders(
        accelerator, batch_size, args.model_name, args.data_location, args.save_location)

    # 4
    # Instantiate the model (we build the model here so that the seed also control new weights initialization)
    model = AutoModelForCausalLM.from_pretrained(args.model_name)
    # model = accelerator.prepare(model)

    optimizer = AdamW(params=model.parameters(), lr=lr)

    # Instantiate scheduler
    lr_scheduler = get_linear_schedule_with_warmup(
        optimizer=optimizer,
        num_warmup_steps=num_warmup_steps,
        num_training_steps=(len(train_dataloader) *
                            num_epochs) // args.gradient_accumulation_steps
    )

    # Prepare everything
    # There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
    # prepare method.
    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
    )

    # Initialize logging variables
    total_train_loss = 0
    total_eval_loss = 0

    # 5
    # Now we train the model
    for epoch in range(num_epochs):
        model.train()
        total_train_loss = 0
        for batch in tqdm(train_dataloader, desc="Training"):
            torch.cuda.empty_cache()
            accelerator.free_memory()
            with accelerator.accumulate(model):
                # Process the batch
                inputs = {k: v.to(accelerator.device)
                          for k, v in batch.items()}
                if "labels" not in inputs:
                    inputs["labels"] = inputs["input_ids"]

                outputs = model(**inputs)
                loss = outputs.loss
                total_train_loss += loss.item()
                accelerator.backward(loss)
                optimizer.step()
                lr_scheduler.step()
                optimizer.zero_grad()

        accelerator.wait_for_everyone()

        # 6
        # Evaluation loop after each training epoch
        model.eval()
        total_eval_loss = 0
        for batch in tqdm(eval_dataloader, "Evaluating"):
            with torch.no_grad():
                inputs = {k: v.to(accelerator.device)
                          for k, v in batch.items()}
                if "labels" not in inputs:
                    inputs["labels"] = inputs["input_ids"]

                outputs = model(**inputs)
                loss = outputs.loss
                total_eval_loss += loss.item()

        # # Log the average losses
        avg_train_loss = total_train_loss / len(train_dataloader)
        avg_eval_loss = total_eval_loss / len(eval_dataloader)
        print(
            f"Epoch: {epoch}, Average Training Loss: {avg_train_loss}, Average Evaluation Loss: {avg_eval_loss}")

        accelerator.wait_for_everyone()

    # 7
    accelerator.wait_for_everyone()
    accelerator.print("saving")
    accelerator.unwrap_model(model).save_pretrained(
        args.save_location,
        is_main_process=accelerator.is_main_process,
        save_function=accelerator.save,
        state_dict=accelerator.get_state_dict(model),
    )

def main():
    args = parse_args()
    training_function(args)

if __name__ == "__main__":
    start = time()
    main()
    print(f"Total Execution Time: {time() - start} seconds")

ds.yaml

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: null
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 6
use_cpu: false

and here are the logs

$ NCCL_P2P_DISABLE=1 accelerate launch --config_file accConfigs/ds.yaml finetuneWithAcc.py --batch_size 2 --seed 42 --model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 --save_location saved_ds --data_location data/GI.xml
Training:   0%|| 0/524 [00:00<?, ?it/s][2024-01-25 13:02:49,465] [WARNING] [parameter_offload.py:87:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.                                                                                                                                                                                                        
Training:   0% 1/524 [34.04s/it] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648                                                                                                                                               
Training:   0% 1/524 [33.90s/it] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824                                                                                                                                               
Training:   0% 2/524 [26.31s/it] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912                                                 
...                                                                                               
Training:   4% 21/524 [19.85s/it] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2048, reducing to 1024
Training:   4% 22/524 [20.46s/it] [WARNING] [stage3.py:2008:step] 89 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
...
Training:   5% 28/524 [23.32s/it] [WARNING] [stage3.py:2008:step] 4 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time

Expected behavior

Not increase iteration time upon loss scaling

huggingface / accelerate