Closed StellaAthena closed 2 years ago
Hello, could you please paste the accelerate config file contents/ DeepSpeed plugin details?
How do I pull that up? I don't see anything that's obviously a configuration file.
It prints out this when I run the command
05/27/2022 19:33:40 - INFO - __main__ - Distributed environment: DEEPSPEED Backend: nccl
Num processes: 8
Process index: 2
Local process index: 2
Device: cuda:2
Mixed precision type: no
ds_config: {'train_batch_size': None, 'gradient_accumulation_steps': 0, 'zero_optimization': {'stage': 3, 'offload_optimizer': {'device': 'none'}}, 'steps_per_print': inf, 'zero_allow_untested_optimizer': True, 'fp16': {'enabled': True}}
Which seems suspicious because fp16
is turned on but mixed precision is listed as no
Hello, thanks for the above information. I believe gradient_accumulation_steps
being 0 is the issue. Please set it to 1 or above.
wrt mixed precision, it is being displayed as no
because DeepSpeed handles it by default and no handling is required by accelerate. Yes, it is confusing and will fix the display to make it more intuitive.
What's the best way to edit the configuration?
Please run accelerate config
on your machine(s) and answer the questions asked. This will generate a config file that will be used automatically to properly set the default options when doing accelerate launch my_script.py args
I've been trying to get some of the demos working with OPT, but I keep running into divide by zero errors. For example, running
accelerate launch examples/pytorch/language-modeling/run_clm_no_trainer.py --model_name_or_path facebook/opt-350m --dataset_name wikitext_tl39
eventually throwsI see this regardless of which model I use and regardless of which dataset I use, though I haven't checked other scripts yet