Closed DuoduoLi closed 1 year ago
Hi !
Could you provide the exact command you launch the training, please? Thanks!
Hi @ydshieh , the command is : deepspeed --master_port 39500 --num_gpus=8 ./code/run_summarization.py --model_name_or_path ./model --do_train --do_eval --train_file ./data/train.json --validation_file ./data/valid.json --output_dir ./output/ --overwrite_output_dir --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --gradient_accumulation_steps=8 --max_source_length 4000 --max_target_length 2000 --max_eval_samples 4000 --num_beams 4 --evaluation_strategy=steps --metric_for_best_model=eval_loss --load_best_model_at_end=True --warmup_steps=1250 --eval_steps 1250 --logging_steps 1250 --save_steps 1250 --num_train_epochs 1 --save_total_limit=10 --ignore_pad_token_for_loss --learning_rate 3e-4 --pad_to_max_length --source_prefix summarize: --deepspeed ./code/configs/dsconfig_zero2.json
Hi @DuoduoLi
Sorry for being late here. Could you try transformers
with the main
branch and see if this still persists?
(python -m pip install --no-cache-dir git+https://github.com/huggingface/transformers@main#egg=transformers
)
(@pacman100 Do you know about this issue and if this is already fixed in main
?)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Environment info
transformers version: 4.31.0 Using distributed or parallel set-up in script?: Deepspeed Deepspeed config: { "fp16": { "enabled": false, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": 0.0003, "betas": [0.9, 0.999], "eps": 1e-08, "weight_decay": 0.0 } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": 0, "warmup_max_lr": 0.0003, "warmup_num_steps": 1.200000e+03 } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2.000000e+08, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2.000000e+08, "contiguous_gradients": true }, "gradient_accumulation_steps": 8, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 64, "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": false, "bf16": { "enabled": false } }
Who can help?
Who can help
@pacman100
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Model I am using (Bert, XLNet ...): mt5
The problem arises when using:
Traceback (most recent call last): File "./code/run_summarization.py", line 902, in
main()
File "./code/run_summarization.py", line 801, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/tmp/env/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/tmp/env/lib/python3.8/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/tmp/env/lib/python3.8/site-packages/transformers/trainer.py", line 2665, in training_step
self.accelerator.backward(loss)
File "/tmp/env/lib/python3.8/site-packages/accelerate/accelerator.py", line 1917, in backward
self.deepspeed_engine_wrapped.backward(loss, kwargs)
File "/tmp/env/lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, kwargs)
File "/tmp/env/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, kwargs)
File "/tmp/env/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1890, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/tmp/env/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1953, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/tmp/env/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/tmp/env/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/tmp/env/lib/python3.8/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/tmp/env/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 871, in reduce_partition_and_remove_grads
self.reduce_ready_partitions_and_remove_grads(param, i)
File "/tmp/env/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1332, in reduce_ready_partitions_and_remove_grads
self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
File "/tmp/env/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 899, in reduce_independent_p_g_buckets_and_remove_grads
self.reduce_ipg_grads()
File "/tmp/env/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1319, in reduce_ipg_grads
self.copy_grads_in_partition(param)
File "/tmp/env/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1239, in copy_grads_in_partition
self.async_accumulate_grad_in_cpu_via_gpu(param)
File "/tmp/env/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1143, in async_accumulate_grad_in_cpu_via_gpu
accumulate_gradients()
File "/tmp/env/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1122, in accumulate_gradients
param.gradaccum.data.view(-1).add(dest_buffer)
AttributeError: 'NoneType' object has no attribute 'data'**
my own modified scripts: (give details below)
The tasks I am working on is: summarize
Expected behavior
when train and set gradient_accumulation_steps >1 in deepspeed, the grad_accum is none. If don't set the gradient_accumulation_steps , it can run.