incorrect gradient accumulation with deepspeed

vwxyzjn commented 1 year ago

System Info

Copy-and-paste the text below in your GitHub issue

- `Accelerate` version: 0.22.0
- Platform: Linux-5.15.0-1023-aws-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.24.3
- PyTorch version (GPU?): 2.0.1+cu117 (False)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 123.22 GB
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - debug: False
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[ ] My own task or dataset (give details below)

Reproduction

import torch
import copy
from accelerate import Accelerator
from accelerate.utils import set_seed
from torch.utils.data import TensorDataset, DataLoader

# seed
set_seed(0)

# define toy inputs and labels
x = torch.tensor([1., 2., 3., 4., 5., 6., 7., 8.])
y = torch.tensor([2., 4., 6., 8., 10., 12., 14., 16.])
gradient_accumulation_steps = 4
batch_size = len(x) // gradient_accumulation_steps

# define dataset and dataloader
dataset = TensorDataset(x, y)
dataloader = DataLoader(dataset, batch_size=batch_size)

# define model, optimizer and loss function
model = torch.nn.Linear(1, 1)
torch.nn.init.zeros_(model.weight)  # Initialize the weight with zeros
model_clone = copy.deepcopy(model)
criterion = torch.nn.MSELoss()
model_optimizer = torch.optim.SGD(model.parameters(), lr=0.02)
accelerator = Accelerator(gradient_accumulation_steps=gradient_accumulation_steps)
model, model_optimizer, dataloader = accelerator.prepare(model, model_optimizer, dataloader)
model_clone_optimizer = torch.optim.SGD(model_clone.parameters(), lr=0.02)

print(f"initial model weight is {accelerator.unwrap_model(model).weight.item():.5f}")
print(f"initial model weight is {model_clone.weight.item():.5f}")

for i, (inputs, labels) in enumerate(dataloader):
    with accelerator.accumulate(model):
        inputs = inputs.view(-1, 1)
        print(i, inputs.flatten())
        labels = labels.view(-1, 1)
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        accelerator.backward(loss)
        model_optimizer.step()
        model_optimizer.zero_grad()

loss = criterion(model_clone(x.view(-1, 1)), y.view(-1, 1))
model_clone_optimizer.zero_grad()
loss.backward()
model_clone_optimizer.step()

print(f"w/ accumulation, the final model weight is {accelerator.unwrap_model(model).weight.item():.5f}")
print(f"w/o accumulation, the final model weight is {model_clone.weight.item():.5f}")

Things work as expected without deepspeed.

$ poetry run accelerate launch --num_processes 1  grad_accu.py 
[2023-09-13 23:01:01,987] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-13 23:01:10,720] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
initial model weight is 0.00000
initial model weight is 0.00000
0 tensor([1., 2.], device='cuda:0')
1 tensor([3., 4.], device='cuda:0')
2 tensor([5., 6.], device='cuda:0')
3 tensor([7., 8.], device='cuda:0')
w/ accumulation, the final model weight is 1.94344
w/o accumulation, the final model weight is 1.94344

Things do not work as expected with deepspeed.

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

$ poetry run accelerate launch  --config_file deepspeed.yaml grad_accu.py 
[2023-09-13 22:57:28,874] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-13 22:57:47,448] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-13 22:57:51,957] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-09-13 22:57:51,957] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-09-13 22:57:52,024] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.2, git-hash=unknown, git-branch=unknown
[2023-09-13 22:57:55,684] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-09-13 22:57:55,684] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2023-09-13 22:57:55,684] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2023-09-13 22:57:55,684] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = SGD
[2023-09-13 22:57:55,684] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=SGD type=<class 'torch.optim.sgd.SGD'>
[2023-09-13 22:57:55,684] [WARNING] [engine.py:1149:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution *****
[2023-09-13 22:57:55,684] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float32 ZeRO stage 1 optimizer
[2023-09-13 22:57:55,685] [INFO] [stage_1_and_2.py:146:__init__] Reduce bucket size 500,000,000
[2023-09-13 22:57:55,685] [INFO] [stage_1_and_2.py:147:__init__] Allgather bucket size 500,000,000
[2023-09-13 22:57:55,685] [INFO] [stage_1_and_2.py:148:__init__] CPU Offload: False
[2023-09-13 22:57:55,685] [INFO] [stage_1_and_2.py:149:__init__] Round robin gradient partitioning: False
Rank: 0 partition count [1] and sizes[(2, False)] 
[2023-09-13 22:57:55,913] [INFO] [utils.py:803:see_memory_usage] Before initializing optimizer states
[2023-09-13 22:57:55,914] [INFO] [utils.py:804:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2023-09-13 22:57:55,914] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory:  used = 84.74 GB, percent = 7.6%
[2023-09-13 22:57:56,005] [INFO] [utils.py:803:see_memory_usage] After initializing optimizer states
[2023-09-13 22:57:56,006] [INFO] [utils.py:804:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2023-09-13 22:57:56,006] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory:  used = 84.74 GB, percent = 7.6%
[2023-09-13 22:57:56,006] [INFO] [stage_1_and_2.py:520:__init__] optimizer state initialized
[2023-09-13 22:57:56,088] [INFO] [utils.py:803:see_memory_usage] After initializing ZeRO optimizer
[2023-09-13 22:57:56,088] [INFO] [utils.py:804:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2023-09-13 22:57:56,088] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory:  used = 84.79 GB, percent = 7.6%
[2023-09-13 22:57:56,093] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = SGD
[2023-09-13 22:57:56,094] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-09-13 22:57:56,094] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2023-09-13 22:57:56,094] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.02], mom=[0]
[2023-09-13 22:57:56,094] [INFO] [config.py:963:print] DeepSpeedEngine configuration:
[2023-09-13 22:57:56,094] [INFO] [config.py:967:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2023-09-13 22:57:56,094] [INFO] [config.py:967:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-09-13 22:57:56,094] [INFO] [config.py:967:print]   amp_enabled .................. False
[2023-09-13 22:57:56,094] [INFO] [config.py:967:print]   amp_params ................... False
[2023-09-13 22:57:56,094] [INFO] [config.py:967:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2023-09-13 22:57:56,094] [INFO] [config.py:967:print]   bfloat16_enabled ............. False
[2023-09-13 22:57:56,094] [INFO] [config.py:967:print]   checkpoint_parallel_write_pipeline  False
[2023-09-13 22:57:56,094] [INFO] [config.py:967:print]   checkpoint_tag_validation_enabled  True
[2023-09-13 22:57:56,094] [INFO] [config.py:967:print]   checkpoint_tag_validation_fail  False
[2023-09-13 22:57:56,094] [INFO] [config.py:967:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7efff803ff40>
[2023-09-13 22:57:56,094] [INFO] [config.py:967:print]   communication_data_type ...... None
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   curriculum_enabled_legacy .... False
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   curriculum_params_legacy ..... False
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   data_efficiency_enabled ...... False
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   dataloader_drop_last ......... False
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   disable_allgather ............ False
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   dump_state ................... False
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   dynamic_loss_scale_args ...... None
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   eigenvalue_enabled ........... False
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   eigenvalue_gas_boundary_resolution  1
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   eigenvalue_layer_num ......... 0
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   eigenvalue_max_iter .......... 100
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   eigenvalue_stability ......... 1e-06
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   eigenvalue_tol ............... 0.01
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   eigenvalue_verbose ........... False
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   elasticity_enabled ........... False
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   fp16_auto_cast ............... None
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   fp16_enabled ................. False
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   fp16_master_weights_and_gradients  False
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   global_rank .................. 0
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   grad_accum_dtype ............. None
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   gradient_accumulation_steps .. 1
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   gradient_clipping ............ 0.0
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   gradient_predivide_factor .... 1.0
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   initial_dynamic_scale ........ 65536
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   load_universal_checkpoint .... False
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   loss_scale ................... 0
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   memory_breakdown ............. False
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   mics_hierarchial_params_gather  False
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   mics_shard_size .............. -1
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   optimizer_legacy_fusion ...... False
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   optimizer_name ............... None
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   optimizer_params ............. None
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   pld_enabled .................. False
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   pld_params ................... False
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   prescale_gradients ........... False
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   scheduler_name ............... None
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   scheduler_params ............. None
[2023-09-13 22:57:56,095] [INFO] [config.py:967:print]   sparse_attention ............. None
[2023-09-13 22:57:56,096] [INFO] [config.py:967:print]   sparse_gradients_enabled ..... False
[2023-09-13 22:57:56,096] [INFO] [config.py:967:print]   steps_per_print .............. inf
[2023-09-13 22:57:56,096] [INFO] [config.py:967:print]   train_batch_size ............. 2
[2023-09-13 22:57:56,096] [INFO] [config.py:967:print]   train_micro_batch_size_per_gpu  2
[2023-09-13 22:57:56,096] [INFO] [config.py:967:print]   use_node_local_storage ....... False
[2023-09-13 22:57:56,096] [INFO] [config.py:967:print]   wall_clock_breakdown ......... False
[2023-09-13 22:57:56,096] [INFO] [config.py:967:print]   world_size ................... 1
[2023-09-13 22:57:56,096] [INFO] [config.py:967:print]   zero_allow_untested_optimizer  True
[2023-09-13 22:57:56,096] [INFO] [config.py:967:print]   zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2023-09-13 22:57:56,096] [INFO] [config.py:967:print]   zero_enabled ................. True
[2023-09-13 22:57:56,096] [INFO] [config.py:967:print]   zero_force_ds_cpu_optimizer .. True
[2023-09-13 22:57:56,096] [INFO] [config.py:967:print]   zero_optimization_stage ...... 1
[2023-09-13 22:57:56,096] [INFO] [config.py:953:print_user_config]   json = {
    "train_batch_size": 2, 
    "train_micro_batch_size_per_gpu": 2, 
    "gradient_accumulation_steps": 1, 
    "zero_optimization": {
        "stage": 1, 
        "offload_optimizer": {
            "device": "none", 
            "nvme_path": null
        }, 
        "offload_param": {
            "device": "none", 
            "nvme_path": null
        }, 
        "stage3_gather_16bit_weights_on_model_save": false
    }, 
    "steps_per_print": inf, 
    "fp16": {
        "enabled": false
    }, 
    "bf16": {
        "enabled": false
    }, 
    "zero_allow_untested_optimizer": true
}
initial model weight is 0.00000
initial model weight is 0.00000
0 tensor([1., 2.], device='cuda:0')
1 tensor([3., 4.], device='cuda:0')
2 tensor([5., 6.], device='cuda:0')
3 tensor([7., 8.], device='cuda:0')
w/ accumulation, the final model weight is 1.64573
w/o accumulation, the final model weight is 1.94344

Expected behavior

The gradient accumulation result of deepspeed should be

w/o accumulation, the final model weight is 1.94344

muellerzr commented 1 year ago

@vwxyzjn try installing via main? I think this may have been fixed

vwxyzjn commented 1 year ago

@muellerzr thanks for the prompt reply! I just tried, and the results still appear incorrect:

pip install git+https://github.com/huggingface/accelerate.git
Collecting git+https://github.com/huggingface/accelerate.git
  Cloning https://github.com/huggingface/accelerate.git to /tmp/pip-req-build-v7rf5ncz
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/accelerate.git /tmp/pip-req-build-v7rf5ncz
  Resolved https://github.com/huggingface/accelerate.git to commit 40a73e0ae0dad0f5b9c0cdcc1b49165fcf08caf9
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done

initial model weight is 0.00000
initial model weight is 0.00000
0 tensor([1., 2.], device='cuda:0')
1 tensor([3., 4.], device='cuda:0')
2 tensor([5., 6.], device='cuda:0')
3 tensor([7., 8.], device='cuda:0')
w/ accumulation, the final model weight is 1.64573
w/o accumulation, the final model weight is 1.94344

muellerzr commented 1 year ago

cc @pacman100

lewtun commented 1 year ago

Could the discrepancy be tied to the fact that the deepspeed plugin reads the number of gradient accumulation steps from the config and this is overriding the value passed to the accelerator?

What happens if you change this part of your config as follows:

deepspeed_config:
  gradient_accumulation_steps: 4

pacman100 commented 1 year ago

Hello @vwxyzjn and @lewtun,

the value passed to Accelerator object is only used if the value in deepspeed config for gradient_accumulation_steps is auto. This is only possible when using a DeepSpeed json config file with auto for gradient_accumulation_steps. In other cases, please specify it correctly when creating deepspeed config via accelerate config command as Lewis suggested.

pacman100 commented 1 year ago

See the tests at https://github.com/huggingface/accelerate/blob/main/tests/deepspeed/test_deepspeed.py#L610 for clarity on this

vwxyzjn commented 1 year ago

Thanks @lewtun and @pacman100, I ran removed deepspeed_config's gradient_accumulation_steps and everything is working as expected again. Sorry for the oversight in the configuration!

 deepspeed_config:
-  gradient_accumulation_steps: 1
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2

_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2023-09-14 14:55:39,464] [INFO] [config.py:971:print]   zero_enabled ................. True
[2023-09-14 14:55:39,464] [INFO] [config.py:971:print]   zero_force_ds_cpu_optimizer .. True
[2023-09-14 14:55:39,464] [INFO] [config.py:971:print]   zero_optimization_stage ...... 2
[2023-09-14 14:55:39,464] [INFO] [config.py:957:print_user_config]   json = {
    "train_batch_size": 8, 
    "train_micro_batch_size_per_gpu": 2, 
    "gradient_accumulation_steps": 4, 
    "zero_optimization": {
        "stage": 2, 
        "offload_optimizer": {
            "device": "none", 
            "nvme_path": null
        }, 
        "offload_param": {
            "device": "none", 
            "nvme_path": null
        }, 
        "stage3_gather_16bit_weights_on_model_save": false
    }, 
    "steps_per_print": inf, 
    "fp16": {
        "enabled": false
    }, 
    "bf16": {
        "enabled": false
    }, 
    "zero_allow_untested_optimizer": true
}
initial model weight is 0.00000
initial model weight is 0.00000
0 tensor([1., 2.], device='cuda:0')
1 tensor([3., 4.], device='cuda:0')
2 tensor([5., 6.], device='cuda:0')
3 tensor([7., 8.], device='cuda:0')
w/ accumulation, the final model weight is 1.94344
w/o accumulation, the final model weight is 1.94344

huggingface / accelerate