Closed benproton closed 1 year ago
thank you for the detailed report, @benproton
As you may have derived from the traceback this has nothing to do with deepspeed
You have an issue inside wandb
, which is a 3rd party package, you can either remove it:
pip uninstall wandb
or a better long term solution - in your command line add --report_to none
which will disable wandb (or any other reporting package you happened to have installed in your environment)
Please try again and let me know if it fixes the problem.
Hey! Thanks so much for the quick reply.
Hmm, still exits at the point of the checkpoint, just not with the error I mentioned:
{'loss': 6.8437, 'learning_rate': 5e-05, 'epoch': 0.01}
0%|β | 500/224238 [51:15<381:53:08, 6.14s/it][INFO|trainer.py:2753] 2023-02-06 16:37:12,461 >> Saving model checkpoint to bennyD/checkpoint-500
[INFO|configuration_utils.py:453] 2023-02-06 16:37:12,462 >> Configuration saved in bennyD/checkpoint-500/config.json
[INFO|configuration_utils.py:359] 2023-02-06 16:37:12,464 >> Configuration saved in bennyD/checkpoint-500/generation_config.json
[INFO|modeling_utils.py:1720] 2023-02-06 16:37:12,809 >> Model weights saved in bennyD/checkpoint-500/pytorch_model.bin
[2023-02-06 16:37:18,583] [INFO] [engine.py:3500:save_16bit_model] Saving model weights to bennyD/checkpoint-500/pytorch_model.bin
[2023-02-06 16:37:18,583] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving bennyD/checkpoint-500/pytorch_model.bin...
[2023-02-06 16:37:31,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved bennyD/checkpoint-500/pytorch_model.bin.
[2023-02-06 16:37:31,187] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step500 is begin to save!
/home/horza/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1365: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/home/horza/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1365: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
[2023-02-06 16:37:31,225] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_model_states.pt
[2023-02-06 16:37:31,225] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2023-02-06 16:37:31,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2023-02-06 16:37:31,843] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2023-02-06 16:37:38,871] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 442827
[2023-02-06 16:37:38,875] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 442828
[2023-02-06 16:37:45,767] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python3', '-u', 'examples/pytorch/translation/run-text-gen.py', '--local_rank=1', '--deepspeed', 'tests/deepspeed/ds_config_zero3.json', '--model_name_or_path', 'EleutherAI/gpt-neo-1.3B', '--output_dir=bennyD', '--evaluation_strategy', 'epoch', '--num_train_epochs', '3', '--dataset_name', 'wikitext', '--dataset_config', 'wikitext-2-raw-v1', '--report_to', 'none'] exits with return code = -9
This is with the following command: deepspeed examples/pytorch/translation/run-text-gen.py --deepspeed tests/deepspeed/ds_config_zero3.json --model_name_or_path EleutherAI/gpt-neo-1.3B --output_dir=bennyD --evaluation_strategy epoch --num_train_epochs 3 --dataset_name wikitext --dataset_config "wikitext-2-raw-v1" --report_to none
I don't see any traceback there.
This often happens when you run out of cpu memory.
As it happens during saving the checkpoint, does the problem go away if you set "stage3_gather_16bit_weights_on_model_save": true
to false
?
Dude! That worked, thanks so much, would never have got that. Logs:
0%|β | 500/224238 [53:12<396:24:51, 6.38s/it][WARNING|trainer.py:2707] 2023-02-06 18:39:45,438 >> deepspeed.save_16bit_model didn't save the model, since stage3_gather_16bit_weights_on_model_save=false. Saving the full checkpoint instead, use zero_to_fp32.py to recover weights
[INFO|trainer.py:2753] 2023-02-06 18:39:45,439 >> Saving model checkpoint to bennyD/checkpoint-500
[INFO|configuration_utils.py:453] 2023-02-06 18:39:45,440 >> Configuration saved in bennyD/checkpoint-500/config.json
[INFO|configuration_utils.py:359] 2023-02-06 18:39:45,442 >> Configuration saved in bennyD/checkpoint-500/generation_config.json
[INFO|modeling_utils.py:1720] 2023-02-06 18:39:45,795 >> Model weights saved in bennyD/checkpoint-500/pytorch_model.bin
[2023-02-06 18:39:45,825] [INFO] [engine.py:3491:save_16bit_model] Did not save the model bennyD/checkpoint-500/pytorch_model.bin because stage3_gather_16bit_weights_on_model_save
is False
[WARNING|trainer.py:2707] 2023-02-06 18:39:45,825 >> deepspeed.save_16bit_model didn't save the model, since stage3_gather_16bit_weights_on_model_save=false. Saving the full checkpoint instead, use zero_to_fp32.py to recover weights
[2023-02-06 18:39:45,865] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step500 is begin to save!
/home/horza/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1365: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/home/horza/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1365: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
[2023-02-06 18:39:45,873] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_model_states.pt
[2023-02-06 18:39:45,873] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2023-02-06 18:39:46,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2023-02-06 18:39:46,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2023-02-06 18:40:37,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2023-02-06 18:40:37,560] [INFO] [engine.py:3397:_save_zero_checkpoint] zero checkpoint saved bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_optim_states.pt
[2023-02-06 18:40:37,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step500 is ready now!
[2023-02-06 18:40:37,656] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step500 is begin to save!
[2023-02-06 18:40:37,679] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_model_states.pt
[2023-02-06 18:40:37,679] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2023-02-06 18:40:38,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2023-02-06 18:40:38,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2023-02-06 18:41:19,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2023-02-06 18:41:19,341] [INFO] [engine.py:3397:_save_zero_checkpoint] zero checkpoint saved bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_optim_states.pt
[2023-02-06 18:41:19,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step500 is ready now!
0%|β | 512/224238
So what does that do and what is the impact of setting it to false? Thanks again
Excellent. It's because it tries to gather the model on cpu and you don't have enough cpu memory to do that. But you don't need to gather the model on cpu.
You can read here about the cost of using stage3_gather_16bit_weights_on_model_save
and more importantly what you need to know if you're not using it.
https://huggingface.co/docs/transformers/main/main_classes/deepspeed#getting-the-model-weights-out
In particular please make sure to read all the way through to and including Offline FP32 Weights Recovery
- which you will have to use when you finished training.
You may close the Issue if you're satisfied, @benproton
If you run into new issues please always open a new Issue. Thank you.
Ok thanks. Is that because I'm offloading to cpu? If I choose not to do that, will that prevent the issue?
indeed. the offloading takes a lot of space on cpu.
Last question then I'll close. Can we therefore assume that the reason I was able to run https://github.com/huggingface/transformers/blob/main/examples/pytorch/translation/run_translation.py with checkpoints successfully - without any errors - is because that script isn't as intensive on the cpu? Thanks
It's hard to tell, as they are different programs. It's possible that with one program itself you were using more memory than the other
it's very easy to tell though, just add --skip_memory_metrics 0
, run a few steps and it'll print you full stats on memory usage - so you can compare the 2 programs. do not use this in production since it adds an overhead.
In general if you were able to start training you should be able to continue training w/o cpu memory oom events. This is one exception where due to zero.Init
when the model gets inited it loads the model directly onto the gpu, so your CPU memory can be actually quite small (smaller than gpu) and it'll still work. However if a user chooses to save the full model they have to consolidate it first on cpu and that's where there might not be enough of memory. That setting is set to True
by default to make it easy for users to start right of the box. As they learn the ropes they will then discover more efficient ways of doing things.
Also unrelated to your questions: If you have plenty of free gpu memory you may want to consider turning offloading off for one or both config entries and even switch to zero stage 2. Each of these will use more gpu memory but will make your training faster. Measure the different options and see which one gives you the fastest training. Again all the stats are printed at the end of each training.
That's all incredibly helpful, thanks so much. I think the main culprit was wandb, disabling that stopped the errors. I just tried turning off cpu offloading altogether and training is now running much faster as you anticipated and the checkpoint saving is still working. I have a good amount of GPU memory across 2 x GPUs (48GB total) and I've been attempting to run larger models across multiple GPUs as the previous code I was using was hindered by relying on the capabilities of a single GPU, so from what I've learned from the docs, zero stage 3 for sure seems the way to go for this, correct? Goal was to prove I can achieve this before investing in more GPUs so mission accomplished! Again thanks so much for all of your help.
You're welcome, @benproton. I'm glad your goal has been reached without spending additional $$.
And zero stage 2 is even faster than stage 3 if you have enough gpu memory to not need to shard model weights.
Also enabling --gradient_checkpointing 1
will use less gpu memory at the cost of 20% slowdown, but which would enable a larger batchsize or a switch to stage 2, so the overall training will be faster.
Spend some time experimenting with different knobs and you should be able to get an even faster training.
Typically the optimal approach would be along these steps:
--gradient_checkpointing 1
if oom thenoffload_param
to cpu
- if oom thenoffload_optimizer
to cpu
- if oomgenerate
use smaller beam search, etc. or alternatively always start with bs=1
and instead progress from there.remember you have --gradient_accumulation_steps=XXX
to get whatever effective batch size you need regardless of your gpu size and --per_device_train_batch_size
All super helpful pointers thanks again
@stas00 I've been experimenting and everything is working great when using a hugging face dataset such as the example I gave. However, whenever I try using the bittensor dataset the program always just hangs early on, either while training or while evaluating with nothing obvious appearing in the logs. Any ideas? Is there anything I can do to determine what is causing the hanging? Thanks.
E.g.: `Time to load utils op: 0.00036215782165527344 seconds [INFO|trainer.py:1516] 2023-02-09 22:55:56,474 >> Running training [INFO|trainer.py:1517] 2023-02-09 22:55:56,474 >> Num examples = 39291 [INFO|trainer.py:1518] 2023-02-09 22:55:56,474 >> Num Epochs = 4 [INFO|trainer.py:1519] 2023-02-09 22:55:56,474 >> Instantaneous batch size per device = 8 [INFO|trainer.py:1520] 2023-02-09 22:55:56,474 >> Total train batch size (w. parallel, distributed & accumulation) = 16 [INFO|trainer.py:1521] 2023-02-09 22:55:56,474 >> Gradient Accumulation steps = 1 [INFO|trainer.py:1522] 2023-02-09 22:55:56,474 >> Total optimization steps = 9824 [INFO|integrations.py:579] 2023-02-09 22:55:56,994 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true" 0%| | 0/9824 [00:00<?, ?it/s][2023-02-09 22:56:02,149] [WARNING] [stage3.py:1939:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding torch.cuda.empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time 1%|β | 50/9824 [01:54<6:00:31, 2.21s/it][INFO|trainer.py:2753] 2023-02-09 22:57:52,401 >> Running Evaluation [INFO|trainer.py:2755] 2023-02-09 22:57:52,401 >> Num examples = 1034 [INFO|trainer.py:2758] 2023-02-09 22:57:52,401 >> Batch size = 8
49%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 32/65 [00:31<00:32, 1.01it/s] `
yes, and I will reply once you open a new Issue and fully document the Issue.
I will give you a quick pointer: https://github.com/stas00/toolbox/blob/master/pytorch/torch-distributed-hanging-solutions.md but we won't continue this discussion in this Issue.
This issue has been resolved and closed for good. New problems require new Issues.
thank you.
yes, and I will reply once you open a new Issue and fully document the Issue.
I will give you a quick pointer: https://github.com/stas00/toolbox/blob/master/pytorch/torch-distributed-hanging-solutions.md but we won't continue this discussion in this Issue.
This issue has been resolved and closed for good. New problems require new Issues.
thank you.
Done, thank you @stas00
System Info
transformers
version: 4.27.0.dev0Who can help?
@stas00, @pacman100
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I've been trying to use the Trainer with deepspeed using the following guide: https://huggingface.co/docs/transformers/v4.25.1/en/main_classes/deepspeed#trainer-deepspeed-integration
Below is my python code:
And the command I'm using is:
deepspeed examples/pytorch/translation/run-text-gen.py --deepspeed tests/deepspeed/ds_config_zero3.json --model_name_or_path EleutherAI/gpt-neo-1.3B --output_dir=bennyD --evaluation_strategy epoch --num_train_epochs 2 --dataset_name wikitext --dataset_config "wikitext-2-raw-v1"
The full stack trace:
It's worth noting that if I run the following code: https://github.com/huggingface/transformers/blob/main/examples/pytorch/translation/run_translation.py used in the guide, and modify it to make checkpoints, I do not get the same error.
Additionally if I add
--save_strategy no
to my command, it completes with no errors. But I need the checkpoints.Please help, been trying to figure this one out for a while.
Expected behavior
The command runs with checkpoints and completes without errors.