[Dreambooth] ValueError: Unet loaded as datatype torch.float16 (should be torch.float32)

Describe the bug

Greetings,

I am following the Dog Toy example on an 8gb GPU,

(https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/README.md#training-on-a-8-gb-gpu)

I use accelerate config to configure DeepSpeed according to the example (see config below) then launch train_dreambooth.py. Everything seems to work hunky-dory until I get a ValueError as seen in the logs below. I get this error while using either CompVis/stable-diffusion-v1-4 or runwayml/stable-diffusion-v1-5 models from huggingface.

My understanding was that the model weights are fp32 unless a different revision of that model was specified (bf16, fp16 etc.), so I don't get how it keeps failing the low precision guard.

Reproduction

Launch script:

#!/bin/bash

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export INSTANCE_DIR="data/sks"
export CLASS_DIR="data/dog"
export OUTPUT_DIR="data/model"

accelerate launch --mixed_precision="fp16" train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --instance_data_dir=$INSTANCE_DIR \
  --class_data_dir=$CLASS_DIR \
  --output_dir=$OUTPUT_DIR \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --instance_prompt="a photo of sks dog" \
  --class_prompt="a photo of dog" \
  --resolution=512 \
  --train_batch_size=1 \
  --sample_batch_size=1 \
  --gradient_accumulation_steps=1 --gradient_checkpointing \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --num_class_images=200 \
  --max_train_steps=800

Logs

/home/{USER}/2diffusers/2diff/lib/python3.10/site-packages/accelerate/accelerator.py:227: FutureWarning: `logging_dir` is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use `project_dir` instead.
  warnings.warn(
[2023-01-23 17:04:42,580] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
01/23/2023 17:04:42 - INFO - __main__ - Distributed environment: DEEPSPEED  Backend: nccl
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda:0
Mixed precision type: fp16
ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 1, 'zero_optimization': {'stage': 2, 'offload_optimizer': {'device': 'cpu'}, 'offload_param': {'device': 'cpu'}, 'stage3_gather_16bit_weights_on_model_save': False}, 'steps_per_print': inf, 'fp16': {'enabled': True, 'auto_cast': True}, 'bf16': {'enabled': False}}

Downloading: 100%|██████████| 543/543 [00:00<00:00, 805kB/s]
Fetching 16 files:   0%|          | 0/16 [00:00<?, ?it/s]
Downloading: 100%|██████████| 342/342 [00:00<00:00, 542kB/s]
Fetching 16 files:   6%|▋         | 1/16 [00:00<00:14,  1.04it/s]
Downloading: 100%|██████████| 4.56k/4.56k [00:00<00:00, 6.15MB/s]
Fetching 16 files:  19%|█▉        | 3/16 [00:01<00:07,  1.64it/s]
Downloading: 100%|██████████| 1.22G/1.22G [00:15<00:00, 78.3MB/s]
Fetching 16 files:  25%|██▌       | 4/16 [00:18<01:12,  6.06s/it]
Downloading: 100%|██████████| 209/209 [00:00<00:00, 279kB/s]
Fetching 16 files:  31%|███▏      | 5/16 [00:19<00:48,  4.41s/it]
Downloading: 100%|██████████| 313/313 [00:00<00:00, 523kB/s]
Fetching 16 files:  38%|███▊      | 6/16 [00:20<00:32,  3.26s/it]
Downloading: 100%|██████████| 592/592 [00:00<00:00, 1.01MB/s]
Fetching 16 files:  44%|████▍     | 7/16 [00:21<00:22,  2.53s/it]
Downloading: 100%|██████████| 492M/492M [00:06<00:00, 79.3MB/s]
Fetching 16 files:  50%|█████     | 8/16 [00:28<00:31,  3.93s/it]
Downloading: 100%|██████████| 525k/525k [00:01<00:00, 517kB/s]
Fetching 16 files:  56%|█████▋    | 9/16 [00:30<00:23,  3.33s/it]
Downloading: 100%|██████████| 472/472 [00:00<00:00, 801kB/s]
Fetching 16 files:  62%|██████▎   | 10/16 [00:31<00:15,  2.61s/it]
Downloading: 100%|██████████| 806/806 [00:00<00:00, 1.19MB/s]
Fetching 16 files:  69%|██████▉   | 11/16 [00:32<00:10,  2.11s/it]
Downloading: 100%|██████████| 1.06M/1.06M [00:01<00:00, 889kB/s] 
Fetching 16 files:  75%|███████▌  | 12/16 [00:34<00:08,  2.15s/it]
Downloading: 100%|██████████| 743/743 [00:00<00:00, 1.24MB/s]
Fetching 16 files:  81%|████████▏ | 13/16 [00:35<00:05,  1.79s/it]
Downloading: 100%|██████████| 3.44G/3.44G [00:43<00:00, 79.0MB/s]
Fetching 16 files:  88%|████████▊ | 14/16 [01:19<00:29, 14.64s/it]
Downloading: 100%|██████████| 522/522 [00:00<00:00, 820kB/s]
Fetching 16 files:  94%|█████████▍| 15/16 [01:20<00:10, 10.45s/it]
Downloading: 100%|██████████| 335M/335M [00:04<00:00, 67.9MB/s]
Fetching 16 files: 100%|██████████| 16/16 [01:26<00:00,  5.41s/it]

{'requires_safety_checker'} was not found in config. Values will be initialized to default values.
{'prediction_type'} was not found in config. Values will be initialized to default values.
{'norm_num_groups'} was not found in config. Values will be initialized to default values.
{'use_linear_projection', 'mid_block_type', 'only_cross_attention', 'resnet_time_scale_shift', 'class_embed_type', 'dual_cross_attention', 'upcast_attention', 'num_class_embeds'} was not found in config. Values will be initialized to default values.
You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .
01/23/2023 17:06:12 - INFO - __main__ - Number of class images to sample: 200.

Generating class images: 100%|██████████| 200/200 [19:56<00:00,  5.98s/it]

Downloading: 100%|██████████| 806/806 [00:00<00:00, 1.23MB/s]

Downloading: 100%|██████████| 1.06M/1.06M [00:01<00:00, 727kB/s] 

Downloading: 100%|██████████| 525k/525k [00:00<00:00, 577kB/s]

Downloading: 100%|██████████| 472/472 [00:00<00:00, 644kB/s]

Downloading: 100%|██████████| 592/592 [00:00<00:00, 791kB/s]
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'variance_type', 'prediction_type'} was not found in config. Values will be initialized to default values.

Downloading: 100%|██████████| 492M/492M [00:06<00:00, 72.3MB/s]
{'norm_num_groups'} was not found in config. Values will be initialized to default values.
{'use_linear_projection', 'mid_block_type', 'only_cross_attention', 'resnet_time_scale_shift', 'class_embed_type', 'dual_cross_attention', 'upcast_attention', 'num_class_embeds'} was not found in config. Values will be initialized to default values.
[2023-01-23 17:26:31,030] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed info: version=0.8.1+867da307, git-hash=867da307, git-branch=master
01/23/2023 17:26:31 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:2 to store for rank: 0
01/23/2023 17:26:31 - INFO - torch.distributed.distributed_c10d - Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 1 nodes.
[2023-01-23 17:26:32,278] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-01-23 17:26:32,279] [INFO] [logging.py:75:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2023-01-23 17:26:32,279] [INFO] [logging.py:75:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2023-01-23 17:26:32,326] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2023-01-23 17:26:32,326] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2023-01-23 17:26:32,327] [INFO] [logging.py:75:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2023-01-23 17:26:32,327] [INFO] [stage_1_and_2.py:141:__init__] Reduce bucket size 500,000,000
[2023-01-23 17:26:32,327] [INFO] [stage_1_and_2.py:142:__init__] Allgather bucket size 500,000,000
[2023-01-23 17:26:32,327] [INFO] [stage_1_and_2.py:143:__init__] CPU Offload: True
[2023-01-23 17:26:32,327] [INFO] [stage_1_and_2.py:144:__init__] Round robin gradient partitioning: False
Using /home/{USER}/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Emitting ninja build file /home/{USER}/.cache/torch_extensions/py310_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.06965351104736328 seconds
Rank: 0 partition count [1] and sizes[(859520964, False)] 
[2023-01-23 17:26:34,692] [INFO] [utils.py:831:see_memory_usage] Before initializing optimizer states
[2023-01-23 17:26:34,692] [INFO] [utils.py:832:see_memory_usage] MA 1.66 GB         Max_MA 3.07 GB         CA 1.66 GB         Max_CA 3 GB 
[2023-01-23 17:26:34,692] [INFO] [utils.py:840:see_memory_usage] CPU Virtual Memory:  used = 9.68 GB, percent = 36.6%
[2023-01-23 17:27:25,867] [INFO] [utils.py:831:see_memory_usage] After initializing optimizer states
[2023-01-23 17:27:25,877] [INFO] [utils.py:832:see_memory_usage] MA 1.66 GB         Max_MA 1.66 GB         CA 1.66 GB         Max_CA 2 GB 
[2023-01-23 17:27:25,878] [INFO] [utils.py:840:see_memory_usage] CPU Virtual Memory:  used = 16.3 GB, percent = 61.7%
[2023-01-23 17:27:25,878] [INFO] [stage_1_and_2.py:522:__init__] optimizer state initialized
[2023-01-23 17:27:26,006] [INFO] [utils.py:831:see_memory_usage] After initializing ZeRO optimizer
[2023-01-23 17:27:26,007] [INFO] [utils.py:832:see_memory_usage] MA 1.66 GB         Max_MA 1.66 GB         CA 1.66 GB         Max_CA 2 GB 
[2023-01-23 17:27:26,008] [INFO] [utils.py:840:see_memory_usage] CPU Virtual Memory:  used = 16.3 GB, percent = 61.7%
[2023-01-23 17:27:26,131] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2023-01-23 17:27:26,132] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-01-23 17:27:26,132] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2023-01-23 17:27:26,132] [INFO] [logging.py:75:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-06], mom=[(0.9, 0.999)]
[2023-01-23 17:27:26,136] [INFO] [config.py:1008:print] DeepSpeedEngine configuration:
[2023-01-23 17:27:26,137] [INFO] [config.py:1012:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2023-01-23 17:27:26,137] [INFO] [config.py:1012:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-01-23 17:27:26,137] [INFO] [config.py:1012:print]   amp_enabled .................. False
[2023-01-23 17:27:26,138] [INFO] [config.py:1012:print]   amp_params ................... False
[2023-01-23 17:27:26,139] [INFO] [config.py:1012:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2023-01-23 17:27:26,140] [INFO] [config.py:1012:print]   bfloat16_enabled ............. False
[2023-01-23 17:27:26,140] [INFO] [config.py:1012:print]   checkpoint_parallel_write_pipeline  False
[2023-01-23 17:27:26,140] [INFO] [config.py:1012:print]   checkpoint_tag_validation_enabled  True
[2023-01-23 17:27:26,140] [INFO] [config.py:1012:print]   checkpoint_tag_validation_fail  False
[2023-01-23 17:27:26,140] [INFO] [config.py:1012:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f0b2fbcbbb0>
[2023-01-23 17:27:26,140] [INFO] [config.py:1012:print]   communication_data_type ...... None
[2023-01-23 17:27:26,140] [INFO] [config.py:1012:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-01-23 17:27:26,140] [INFO] [config.py:1012:print]   curriculum_enabled_legacy .... False
[2023-01-23 17:27:26,141] [INFO] [config.py:1012:print]   curriculum_params_legacy ..... False
[2023-01-23 17:27:26,141] [INFO] [config.py:1012:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-01-23 17:27:26,141] [INFO] [config.py:1012:print]   data_efficiency_enabled ...... False
[2023-01-23 17:27:26,141] [INFO] [config.py:1012:print]   dataloader_drop_last ......... False
[2023-01-23 17:27:26,141] [INFO] [config.py:1012:print]   disable_allgather ............ False
[2023-01-23 17:27:26,141] [INFO] [config.py:1012:print]   dump_state ................... False
[2023-01-23 17:27:26,141] [INFO] [config.py:1012:print]   dynamic_loss_scale_args ...... None
[2023-01-23 17:27:26,142] [INFO] [config.py:1012:print]   eigenvalue_enabled ........... False
[2023-01-23 17:27:26,142] [INFO] [config.py:1012:print]   eigenvalue_gas_boundary_resolution  1
[2023-01-23 17:27:26,142] [INFO] [config.py:1012:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2023-01-23 17:27:26,142] [INFO] [config.py:1012:print]   eigenvalue_layer_num ......... 0
[2023-01-23 17:27:26,142] [INFO] [config.py:1012:print]   eigenvalue_max_iter .......... 100
[2023-01-23 17:27:26,142] [INFO] [config.py:1012:print]   eigenvalue_stability ......... 1e-06
[2023-01-23 17:27:26,142] [INFO] [config.py:1012:print]   eigenvalue_tol ............... 0.01
[2023-01-23 17:27:26,142] [INFO] [config.py:1012:print]   eigenvalue_verbose ........... False
[2023-01-23 17:27:26,143] [INFO] [config.py:1012:print]   elasticity_enabled ........... False
[2023-01-23 17:27:26,143] [INFO] [config.py:1012:print]   flops_profiler_config ........ {
    "enabled": false, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2023-01-23 17:27:26,143] [INFO] [config.py:1012:print]   fp16_auto_cast ............... True
[2023-01-23 17:27:26,143] [INFO] [config.py:1012:print]   fp16_enabled ................. True
[2023-01-23 17:27:26,143] [INFO] [config.py:1012:print]   fp16_master_weights_and_gradients  False
[2023-01-23 17:27:26,143] [INFO] [config.py:1012:print]   global_rank .................. 0
[2023-01-23 17:27:26,143] [INFO] [config.py:1012:print]   grad_accum_dtype ............. None
[2023-01-23 17:27:26,144] [INFO] [config.py:1012:print]   gradient_accumulation_steps .. 1
[2023-01-23 17:27:26,144] [INFO] [config.py:1012:print]   gradient_clipping ............ 0.0
[2023-01-23 17:27:26,144] [INFO] [config.py:1012:print]   gradient_predivide_factor .... 1.0
[2023-01-23 17:27:26,144] [INFO] [config.py:1012:print]   initial_dynamic_scale ........ 65536
[2023-01-23 17:27:26,144] [INFO] [config.py:1012:print]   load_universal_checkpoint .... False
[2023-01-23 17:27:26,145] [INFO] [config.py:1012:print]   loss_scale ................... 0
[2023-01-23 17:27:26,145] [INFO] [config.py:1012:print]   memory_breakdown ............. False
[2023-01-23 17:27:26,145] [INFO] [config.py:1012:print]   monitor_config ............... <deepspeed.monitor.config.DeepSpeedMonitorConfig object at 0x7f0b2fbcb970>
[2023-01-23 17:27:26,145] [INFO] [config.py:1012:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2023-01-23 17:27:26,145] [INFO] [config.py:1012:print]   optimizer_legacy_fusion ...... False
[2023-01-23 17:27:26,145] [INFO] [config.py:1012:print]   optimizer_name ............... None
[2023-01-23 17:27:26,146] [INFO] [config.py:1012:print]   optimizer_params ............. None
[2023-01-23 17:27:26,146] [INFO] [config.py:1012:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-01-23 17:27:26,146] [INFO] [config.py:1012:print]   pld_enabled .................. False
[2023-01-23 17:27:26,146] [INFO] [config.py:1012:print]   pld_params ................... False
[2023-01-23 17:27:26,146] [INFO] [config.py:1012:print]   prescale_gradients ........... False
[2023-01-23 17:27:26,146] [INFO] [config.py:1012:print]   scheduler_name ............... None
[2023-01-23 17:27:26,146] [INFO] [config.py:1012:print]   scheduler_params ............. None
[2023-01-23 17:27:26,147] [INFO] [config.py:1012:print]   sparse_attention ............. None
[2023-01-23 17:27:26,147] [INFO] [config.py:1012:print]   sparse_gradients_enabled ..... False
[2023-01-23 17:27:26,147] [INFO] [config.py:1012:print]   steps_per_print .............. inf
[2023-01-23 17:27:26,147] [INFO] [config.py:1012:print]   train_batch_size ............. 1
[2023-01-23 17:27:26,147] [INFO] [config.py:1012:print]   train_micro_batch_size_per_gpu  1
[2023-01-23 17:27:26,147] [INFO] [config.py:1012:print]   use_node_local_storage ....... False
[2023-01-23 17:27:26,147] [INFO] [config.py:1012:print]   wall_clock_breakdown ......... False
[2023-01-23 17:27:26,148] [INFO] [config.py:1012:print]   world_size ................... 1
[2023-01-23 17:27:26,148] [INFO] [config.py:1012:print]   zero_allow_untested_optimizer  True
[2023-01-23 17:27:26,150] [INFO] [config.py:1012:print]   zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False
[2023-01-23 17:27:26,150] [INFO] [config.py:1012:print]   zero_enabled ................. True
[2023-01-23 17:27:26,150] [INFO] [config.py:1012:print]   zero_optimization_stage ...... 2
[2023-01-23 17:27:26,151] [INFO] [config.py:997:print_user_config]   json = {
    "train_batch_size": 1, 
    "train_micro_batch_size_per_gpu": 1, 
    "gradient_accumulation_steps": 1, 
    "zero_optimization": {
        "stage": 2, 
        "offload_optimizer": {
            "device": "cpu"
        }, 
        "offload_param": {
            "device": "cpu"
        }, 
        "stage3_gather_16bit_weights_on_model_save": false
    }, 
    "steps_per_print": inf, 
    "fp16": {
        "enabled": true, 
        "auto_cast": true
    }, 
    "bf16": {
        "enabled": false
    }, 
    "zero_allow_untested_optimizer": true
}
Using /home/{USER}/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.010467290878295898 seconds
Traceback (most recent call last):
  File "/home/{USER}/2diffusers/diffusers/examples/dreambooth/train_dreambooth.py", line 879, in <module>
    main(args)
  File "/home/{USER}/2diffusers/diffusers/examples/dreambooth/train_dreambooth.py", line 720, in main
    raise ValueError(f"Unet loaded as datatype {unet.dtype}. {low_precision_error_string}")
ValueError: Unet loaded as datatype torch.float16. Please make sure to always have all model weights in full float32 precision when starting training - even if doing mixed precision training. copy of the weights should still be float32.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3942) of binary: /home/{USER}/2diffusers/2diff/bin/python3

System Info

- 'diffusers' version: 0.12.0.dev0
- Platform: Linux-5.15.79.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
- Python version: 3.10.9
- PyTorch version (GPU?): 1.13.1+cu117 (True)
- Huggingface_hub version: 0.11.1
- Transformers version: 0.15.0.dev0
- Accelerate version: not installed
- xFormers version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Yes

I don't know why Accelerate is showing as not installed, but when I run pip show accelerate I get:

Name: accelerate
Version: 0.15.0.dev0
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: sylvain@huggingface.co
License: Apache
Location: /home/{USER}/2diffusers/2diff/lib/python3.10/site-packages
Requires: numpy, packaging, psutil, pyyaml, torch
Required-by:

default_config.yaml for Accelerate:

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
use_cpu: false

huggingface / diffusers