Divide by zero errors with OPT

StellaAthena commented 2 years ago

I've been trying to get some of the demos working with OPT, but I keep running into divide by zero errors. For example, running accelerate launch examples/pytorch/language-modeling/run_clm_no_trainer.py --model_name_or_path facebook/opt-350m --dataset_name wikitext_tl39 eventually throws

Grouping texts in chunks of 1024:   0%|                                                                                                                         | 0/1767 [00:00<?, ?ba/s]Traceback (most recent call last):
  File "run_clm_no_trainer.py", line 625, in <module>
    main()
  File "run_clm_no_trainer.py", line 475, in main
Grouping texts in chunks of 1024:   0%|                                                                                                                         | 0/1767 [00:00<?, ?ba/s]    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
  File "/home/mchorse/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 440, in prepare
    result = self._prepare_deepspeed(*args)
  File "/home/mchorse/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 555, in _prepare_deepspeed
    engine = DeepSpeedEngineWrapper(
  File "/home/mchorse/.local/lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 32, in __init__
    super().__init__(*args, **kwargs)
  File "/home/mchorse/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 239, in __init__
    self._configure_with_arguments(args, mpu)
  File "/home/mchorse/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 872, in _configure_with_arguments
Grouping texts in chunks of 1024:   0%|                                                                                                                         | 0/1767 [00:00<?, ?ba/s]    self._config = DeepSpeedConfig(self.config, mpu)
  File "/home/mchorse/.local/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 875, in __init__
Grouping texts in chunks of 1024:   0%|                                                                                                                         | 0/1767 [00:00<?, ?ba/s]    self._configure_train_batch_size()
  File "/home/mchorse/.local/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 1050, in _configure_train_batch_size
    self._set_batch_related_parameters()
  File "/home/mchorse/.local/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 1025, in _set_batch_related_parameters
    micro_batch //= grad_acc
ZeroDivisionError: integer division or modulo by zero
Grouping texts in chunks of 1024:   2%|██▋                                                                                                             | 42/1767 [00:01<01:15, 22.81ba/s]

I see this regardless of which model I use and regardless of which dataset I use, though I haven't checked other scripts yet

pacman100 commented 2 years ago

Hello, could you please paste the accelerate config file contents/ DeepSpeed plugin details?

StellaAthena commented 2 years ago

How do I pull that up? I don't see anything that's obviously a configuration file.

StellaAthena commented 2 years ago

It prints out this when I run the command

05/27/2022 19:33:40 - INFO - __main__ - Distributed environment: DEEPSPEED  Backend: nccl
Num processes: 8
Process index: 2
Local process index: 2
Device: cuda:2
Mixed precision type: no
ds_config: {'train_batch_size': None, 'gradient_accumulation_steps': 0, 'zero_optimization': {'stage': 3, 'offload_optimizer': {'device': 'none'}}, 'steps_per_print': inf, 'zero_allow_untested_optimizer': True, 'fp16': {'enabled': True}}

Which seems suspicious because fp16 is turned on but mixed precision is listed as no

pacman100 commented 2 years ago

Hello, thanks for the above information. I believe gradient_accumulation_steps being 0 is the issue. Please set it to 1 or above.

wrt mixed precision, it is being displayed as no because DeepSpeed handles it by default and no handling is required by accelerate. Yes, it is confusing and will fix the display to make it more intuitive.

StellaAthena commented 2 years ago

What's the best way to edit the configuration?

pacman100 commented 2 years ago

Please run accelerate config on your machine(s) and answer the questions asked. This will generate a config file that will be used automatically to properly set the default options when doing accelerate launch my_script.py args

huggingface / accelerate

Divide by zero errors with OPT #406