Closed vwxyzjn closed 1 year ago
@vwxyzjn try installing via main? I think this may have been fixed
@muellerzr thanks for the prompt reply! I just tried, and the results still appear incorrect:
pip install git+https://github.com/huggingface/accelerate.git
Collecting git+https://github.com/huggingface/accelerate.git
Cloning https://github.com/huggingface/accelerate.git to /tmp/pip-req-build-v7rf5ncz
Running command git clone --filter=blob:none --quiet https://github.com/huggingface/accelerate.git /tmp/pip-req-build-v7rf5ncz
Resolved https://github.com/huggingface/accelerate.git to commit 40a73e0ae0dad0f5b9c0cdcc1b49165fcf08caf9
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
initial model weight is 0.00000
initial model weight is 0.00000
0 tensor([1., 2.], device='cuda:0')
1 tensor([3., 4.], device='cuda:0')
2 tensor([5., 6.], device='cuda:0')
3 tensor([7., 8.], device='cuda:0')
w/ accumulation, the final model weight is 1.64573
w/o accumulation, the final model weight is 1.94344
cc @pacman100
Could the discrepancy be tied to the fact that the deepspeed plugin reads the number of gradient accumulation steps from the config and this is overriding the value passed to the accelerator?
What happens if you change this part of your config as follows:
deepspeed_config:
gradient_accumulation_steps: 4
Hello @vwxyzjn and @lewtun,
the value passed to Accelerator
object is only used if the value in deepspeed config for gradient_accumulation_steps
is auto
. This is only possible when using a DeepSpeed json config file with auto
for gradient_accumulation_steps
. In other cases, please specify it correctly when creating deepspeed config via accelerate config
command as Lewis suggested.
See the tests at https://github.com/huggingface/accelerate/blob/main/tests/deepspeed/test_deepspeed.py#L610 for clarity on this
Thanks @lewtun and @pacman100, I ran removed deepspeed_config
's gradient_accumulation_steps
and everything is working as expected again. Sorry for the oversight in the configuration!
deepspeed_config:
- gradient_accumulation_steps: 1
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: false
zero_stage: 2
_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2023-09-14 14:55:39,464] [INFO] [config.py:971:print] zero_enabled ................. True
[2023-09-14 14:55:39,464] [INFO] [config.py:971:print] zero_force_ds_cpu_optimizer .. True
[2023-09-14 14:55:39,464] [INFO] [config.py:971:print] zero_optimization_stage ...... 2
[2023-09-14 14:55:39,464] [INFO] [config.py:957:print_user_config] json = {
"train_batch_size": 8,
"train_micro_batch_size_per_gpu": 2,
"gradient_accumulation_steps": 4,
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "none",
"nvme_path": null
},
"offload_param": {
"device": "none",
"nvme_path": null
},
"stage3_gather_16bit_weights_on_model_save": false
},
"steps_per_print": inf,
"fp16": {
"enabled": false
},
"bf16": {
"enabled": false
},
"zero_allow_untested_optimizer": true
}
initial model weight is 0.00000
initial model weight is 0.00000
0 tensor([1., 2.], device='cuda:0')
1 tensor([3., 4.], device='cuda:0')
2 tensor([5., 6.], device='cuda:0')
3 tensor([7., 8.], device='cuda:0')
w/ accumulation, the final model weight is 1.94344
w/o accumulation, the final model weight is 1.94344
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Things work as expected without deepspeed.
Things do not work as expected with deepspeed.
Expected behavior
The gradient accumulation result of deepspeed should be