huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.87k stars 27.2k forks source link

grad_accum is none when using gradient_accumulation_steps in DeepSpeed #25810

Closed DuoduoLi closed 1 year ago

DuoduoLi commented 1 year ago

System Info

Environment info

transformers version: 4.31.0 Using distributed or parallel set-up in script?: Deepspeed Deepspeed config: { "fp16": { "enabled": false, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": 0.0003, "betas": [0.9, 0.999], "eps": 1e-08, "weight_decay": 0.0 } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": 0, "warmup_max_lr": 0.0003, "warmup_num_steps": 1.200000e+03 } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2.000000e+08, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2.000000e+08, "contiguous_gradients": true }, "gradient_accumulation_steps": 8, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 64, "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": false, "bf16": { "enabled": false } }

Who can help?

Who can help

@pacman100

Information

Tasks

Reproduction

Model I am using (Bert, XLNet ...): mt5

The problem arises when using:

Traceback (most recent call last): File "./code/run_summarization.py", line 902, in main() File "./code/run_summarization.py", line 801, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/tmp/env/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/tmp/env/lib/python3.8/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/tmp/env/lib/python3.8/site-packages/transformers/trainer.py", line 2665, in training_step self.accelerator.backward(loss) File "/tmp/env/lib/python3.8/site-packages/accelerate/accelerator.py", line 1917, in backward self.deepspeed_engine_wrapped.backward(loss, kwargs) File "/tmp/env/lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, kwargs) File "/tmp/env/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, kwargs) File "/tmp/env/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1890, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/tmp/env/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1953, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/tmp/env/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/tmp/env/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/tmp/env/lib/python3.8/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/tmp/env/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 871, in reduce_partition_and_remove_grads self.reduce_ready_partitions_and_remove_grads(param, i) File "/tmp/env/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1332, in reduce_ready_partitions_and_remove_grads self.reduce_independent_p_g_buckets_and_remove_grads(param, i) File "/tmp/env/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 899, in reduce_independent_p_g_buckets_and_remove_grads self.reduce_ipg_grads() File "/tmp/env/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1319, in reduce_ipg_grads self.copy_grads_in_partition(param) File "/tmp/env/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1239, in copy_grads_in_partition self.async_accumulate_grad_in_cpu_via_gpu(param) File "/tmp/env/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1143, in async_accumulate_grad_in_cpu_via_gpu accumulate_gradients() File "/tmp/env/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1122, in accumulate_gradients param.gradaccum.data.view(-1).add(dest_buffer) AttributeError: 'NoneType' object has no attribute 'data'** my own modified scripts: (give details below)

The tasks I am working on is: summarize

image

Expected behavior

when train and set gradient_accumulation_steps >1 in deepspeed, the grad_accum is none. If don't set the gradient_accumulation_steps , it can run.

ydshieh commented 1 year ago

Hi !

Could you provide the exact command you launch the training, please? Thanks!

DuoduoLi commented 1 year ago

Hi @ydshieh , the command is : deepspeed --master_port 39500 --num_gpus=8 ./code/run_summarization.py --model_name_or_path ./model --do_train --do_eval --train_file ./data/train.json --validation_file ./data/valid.json --output_dir ./output/ --overwrite_output_dir --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --gradient_accumulation_steps=8 --max_source_length 4000 --max_target_length 2000 --max_eval_samples 4000 --num_beams 4 --evaluation_strategy=steps --metric_for_best_model=eval_loss --load_best_model_at_end=True --warmup_steps=1250 --eval_steps 1250 --logging_steps 1250 --save_steps 1250 --num_train_epochs 1 --save_total_limit=10 --ignore_pad_token_for_loss --learning_rate 3e-4 --pad_to_max_length --source_prefix summarize: --deepspeed ./code/configs/dsconfig_zero2.json

ydshieh commented 1 year ago

Hi @DuoduoLi

Sorry for being late here. Could you try transformers with the main branch and see if this still persists? (python -m pip install --no-cache-dir git+https://github.com/huggingface/transformers@main#egg=transformers)

(@pacman100 Do you know about this issue and if this is already fixed in main?)

BCreativeS commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.