huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.23k stars 26.61k forks source link

using Deepspeed zero stage3 finetune sd2, dimension error occurs #22705

Closed uygnef closed 1 year ago

uygnef commented 1 year ago

System Info

Describe the bug An error is reported when using deepspeed's zero stage3 finetune diffusers/examples/text_to_image/train_text_to_image.py script. My machine's GPU is 2*A100, running on deepspeed zero stage3

def train(args):
    if args.non_ema_revision is not None:
        deprecate(
            "non_ema_revision!=None",
            "0.15.0",
            message=(
                "Downloading 'non_ema' weights from revision branches of the Hub is deprecated. Please make sure to"
                " use `--variant=non_ema` instead."
            ),
        )
    # logging_dir = os.path.join(args.output_dir, args.logging_dir)

    accelerator_project_config = ProjectConfiguration(total_limit=args.checkpoints_total_limit)
    deepspeed_plugin = DeepSpeedPlugin(zero_stage=3, gradient_accumulation_steps=2)
    # deepspeed_plugin.set_mixed_precision("fp16")
    accelerator = Accelerator(
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        mixed_precision=args.mixed_precision,
        log_with=args.report_to,
        logging_dir=args.log_dir,
        project_config=accelerator_project_config,
        deepspeed_plugin=deepspeed_plugin
    )

error log is

04/11/2023 16:59:12 0:INFO: Prepare everything with our accelerator.
[2023-04-11 16:59:12,036] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.6.5, git-hash=unknown, git-branch=unknown
04112023 16:59:13|INFO|torch.distributed.distributed_c10d| Added key: store_based_barrier_key:2 to store for rank: 0
04112023 16:59:13|INFO|torch.distributed.distributed_c10d| Added key: store_based_barrier_key:2 to store for rank: 1
04112023 16:59:13|INFO|torch.distributed.distributed_c10d| Rank 1: Completed store-based barrier for key:store_based_barrier_key:2 with 2 nodes.
/usr/local/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
  warnings.warn(
04112023 16:59:13|INFO|torch.distributed.distributed_c10d| Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 2 nodes.
/usr/local/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
  warnings.warn(
[2023-04-11 16:59:13,796] [INFO] [engine.py:278:__init__] DeepSpeed Flops Profiler Enabled: False
[2023-04-11 16:59:13,796] [INFO] [engine.py:1086:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[2023-04-11 16:59:13,796] [INFO] [engine.py:1092:_configure_optimizer] Using client Optimizer as basic optimizer
[2023-04-11 16:59:13,878] [INFO] [engine.py:1108:_configure_optimizer] DeepSpeed Basic Optimizer = AdamW
[2023-04-11 16:59:13,878] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2023-04-11 16:59:13,878] [INFO] [logging.py:69:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer
[2023-04-11 16:59:13,878] [INFO] [engine.py:1410:_configure_zero_optimizer] Initializing ZeRO Stage 3
[2023-04-11 16:59:13,887] [INFO] [stage3.py:275:__init__] Reduce bucket size 500000000
[2023-04-11 16:59:13,887] [INFO] [stage3.py:276:__init__] Prefetch bucket size 50000000
Using /home/hadoop-hmart-waimai-rank/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Using /home/hadoop-hmart-waimai-rank/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Emitting ninja build file /home/hadoop-hmart-waimai-rank/.cache/torch_extensions/py39_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.5212891101837158 seconds
Loading extension module utils...
Time to load utils op: 0.5023727416992188 seconds
[2023-04-11 16:59:16,286] [INFO] [stage3.py:567:_setup_for_real_optimizer] optimizer state initialized
Using /home/hadoop-hmart-waimai-rank/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0005068778991699219 seconds
[2023-04-11 16:59:16,615] [INFO] [utils.py:828:see_memory_usage] After initializing ZeRO optimizer
[2023-04-11 16:59:16,616] [INFO] [utils.py:829:see_memory_usage] MA 7.45 GB         Max_MA 10.52 GB         CA 11.47 GB         Max_CA 11 GB 
[2023-04-11 16:59:16,616] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 5.49 GB, percent = 2.4%
[2023-04-11 16:59:16,616] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2023-04-11 16:59:16,616] [INFO] [engine.py:795:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2023-04-11 16:59:16,616] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2023-04-11 16:59:16,617] [INFO] [logging.py:69:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0001], mom=[(0.9, 0.999)]
[2023-04-11 16:59:16,618] [INFO] [config.py:1059:print] DeepSpeedEngine configuration:
[2023-04-11 16:59:16,619] [INFO] [config.py:1063:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2023-04-11 16:59:16,619] [INFO] [config.py:1063:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-04-11 16:59:16,619] [INFO] [config.py:1063:print]   amp_enabled .................. False
[2023-04-11 16:59:16,619] [INFO] [config.py:1063:print]   amp_params ................... False
[2023-04-11 16:59:16,619] [INFO] [config.py:1063:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": null, 
    "exps_dir": null, 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2023-04-11 16:59:16,619] [INFO] [config.py:1063:print]   bfloat16_enabled ............. False
[2023-04-11 16:59:16,619] [INFO] [config.py:1063:print]   checkpoint_tag_validation_enabled  True
[2023-04-11 16:59:16,619] [INFO] [config.py:1063:print]   checkpoint_tag_validation_fail  False
[2023-04-11 16:59:16,619] [INFO] [config.py:1063:print]   communication_data_type ...... None
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print]   curriculum_enabled ........... False
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print]   curriculum_params ............ False
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print]   dataloader_drop_last ......... False
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print]   disable_allgather ............ False
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print]   dump_state ................... False
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print]   dynamic_loss_scale_args ...... None
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print]   eigenvalue_enabled ........... False
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print]   eigenvalue_gas_boundary_resolution  1
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print]   eigenvalue_layer_num ......... 0
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print]   eigenvalue_max_iter .......... 100
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print]   eigenvalue_stability ......... 1e-06
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print]   eigenvalue_tol ............... 0.01
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print]   eigenvalue_verbose ........... False
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print]   elasticity_enabled ........... False
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print]   flops_profiler_config ........ {
    "enabled": false, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print]   fp16_enabled ................. True
[2023-04-11 16:59:16,620] [INFO] [config.py:1063:print]   fp16_master_weights_and_gradients  False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   fp16_mixed_quantize .......... False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   global_rank .................. 0
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   gradient_accumulation_steps .. 1
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   gradient_clipping ............ 0.0
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   gradient_predivide_factor .... 1.0
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   initial_dynamic_scale ........ 4294967296
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   loss_scale ................... 0
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   memory_breakdown ............. False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   optimizer_legacy_fusion ...... False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   optimizer_name ............... None
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   optimizer_params ............. None
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   pld_enabled .................. False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   pld_params ................... False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   prescale_gradients ........... False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   quantize_change_rate ......... 0.001
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   quantize_groups .............. 1
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   quantize_offset .............. 1000
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   quantize_period .............. 1000
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   quantize_rounding ............ 0
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   quantize_start_bits .......... 16
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   quantize_target_bits ......... 8
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   quantize_training_enabled .... False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   quantize_type ................ 0
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   quantize_verbose ............. False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   scheduler_name ............... None
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   scheduler_params ............. None
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   sparse_attention ............. None
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   sparse_gradients_enabled ..... False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   steps_per_print .............. inf
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   tensorboard_enabled .......... False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   tensorboard_job_name ......... DeepSpeedJobName
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   tensorboard_output_path ...... 
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   train_batch_size ............. 16
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   train_micro_batch_size_per_gpu  8
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   use_quantizer_kernel ......... False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   wall_clock_breakdown ......... False
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   world_size ................... 2
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   zero_allow_untested_optimizer  True
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   zero_config .................. {
    "stage": 3, 
    "contiguous_gradients": true, 
    "reduce_scatter": true, 
    "reduce_bucket_size": 5.000000e+08, 
    "allgather_partitions": true, 
    "allgather_bucket_size": 5.000000e+08, 
    "overlap_comm": true, 
    "load_from_fp32_weights": true, 
    "elastic_checkpoint": false, 
    "offload_param": null, 
    "offload_optimizer": null, 
    "sub_group_size": 1.000000e+09, 
    "prefetch_bucket_size": 5.000000e+07, 
    "param_persistence_threshold": 1.000000e+05, 
    "max_live_parameters": 1.000000e+09, 
    "max_reuse_distance": 1.000000e+09, 
    "gather_16bit_weights_on_model_save": false, 
    "ignore_unused_parameters": true, 
    "round_robin_gradients": false, 
    "legacy_stage1": false
}
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   zero_enabled ................. True
[2023-04-11 16:59:16,621] [INFO] [config.py:1063:print]   zero_optimization_stage ...... 3
[2023-04-11 16:59:16,622] [INFO] [config.py:1065:print]   json = {
    "train_batch_size": 16, 
    "train_micro_batch_size_per_gpu": 8, 
    "gradient_accumulation_steps": 1, 
    "zero_optimization": {
        "stage": 3, 
        "offload_optimizer": {
            "device": "none"
        }, 
        "offload_param": {
            "device": "none"
        }, 
        "stage3_gather_16bit_weights_on_model_save": false
    }, 
    "steps_per_print": inf, 
    "fp16": {
        "enabled": true, 
        "auto_cast": true
    }, 
    "bf16": {
        "enabled": false
    }, 
    "zero_allow_untested_optimizer": true
}
Using /home/hadoop-hmart-waimai-rank/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0004420280456542969 seconds
04/11/2023 16:59:16 0:INFO: set weight type
04/11/2023 16:59:16 0:INFO: Move text_encode and vae to gpu and cast to weight_dtype
04/11/2023 16:59:16 0:INFO: [starship] accelerate not support all python data type
04/11/2023 16:59:16 0:INFO: ***** Running training *****
04/11/2023 16:59:16 0:INFO:   Num examples = 400
04/11/2023 16:59:16 0:INFO:   Num Epochs = 100
04/11/2023 16:59:16 0:INFO:   Instantaneous batch size per device = 8
04/11/2023 16:59:16 0:INFO:   Total train batch size (w. parallel, distributed & accumulation) = 16
04/11/2023 16:59:16 0:INFO:   Gradient Accumulation steps = 1
04/11/2023 16:59:16 0:INFO:   Total optimization steps = 2500
Steps:   0%|                                                                               | 0/2500 [00:00<?, ?it/s]Parameter containing:
tensor([], device='cuda:0', dtype=torch.float16)
Traceback (most recent call last):
  File "/workdir/fengyu05/501587/2924467c592a472aa750166c252e166d/src/app/main.py", line 29, in <module>
    main()
  File "/workdir/fengyu05/501587/2924467c592a472aa750166c252e166d/src/app/main.py", line 21, in main
    run_aigc(args)
  File "/workdir/fengyu05/501587/2924467c592a472aa750166c252e166d/src/app/task.py", line 61, in run_aigc
    train(args)
  File "/workdir/fengyu05/501587/2924467c592a472aa750166c252e166d/src/diffuser/train_txt2img.py", line 526, in train
    encoder_hidden_states = text_encoder(batch["input_ids"].to(accelerator.device))[0]
  File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/conda/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 823, in forward
    return self.text_model(
  File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/conda/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 719, in forward
    hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
  File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/conda/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 234, in forward
    inputs_embeds = self.token_embedding(input_ids)
  File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 160, in forward
    return F.embedding(
  File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: 'weight' must be 2-D
Parameter containing:
tensor([], device='cuda:1', dtype=torch.float16)
Steps:   0%|                                                                               | 0/2500 [00:05<?, ?it/s]
Traceback (most recent call last):
  File "/workdir/fengyu05/501587/2924467c592a472aa750166c252e166d/src/app/main.py", line 29, in <module>
    main()
  File "/workdir/fengyu05/501587/2924467c592a472aa750166c252e166d/src/app/main.py", line 21, in main
    run_aigc(args)
  File "/workdir/fengyu05/501587/2924467c592a472aa750166c252e166d/src/app/task.py", line 61, in run_aigc
    train(args)
  File "/workdir/fengyu05/501587/2924467c592a472aa750166c252e166d/src/diffuser/train_txt2img.py", line 526, in train
    encoder_hidden_states = text_encoder(batch["input_ids"].to(accelerator.device))[0]
  File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/conda/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 823, in forward
    return self.text_model(
  File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/conda/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 719, in forward
    hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
  File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/conda/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 234, in forward
    inputs_embeds = self.token_embedding(input_ids)
  File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 160, in forward
    return F.embedding(
  File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: 'weight' must be 2-D
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 32336) of binary: /usr/local/conda/bin/python
Traceback (most recent call last):
  File "/usr/local/conda/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/conda/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/conda/lib/python3.9/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/usr/local/conda/lib/python3.9/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/usr/local/conda/lib/python3.9/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/usr/local/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/usr/local/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
src/app/main.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-04-11_16:59:28
  host      : workbenchxwmx64350ee0-f9ggd
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 32337)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-11_16:59:28
  host      : workbenchxwmx64350ee0-f9ggd
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 32336)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

real    0m26.485s
user    0m23.241s
sys     0m22.802s

I read https://github.com/huggingface/diffusers/issues/1865 , https://www.deepspeed.ai/tutorials/zero/#allocating-massive-megatron-lm-models and https://deepspeed.readthedocs.io/en/latest/zero3.html#deepspeed.zero.GatheredParameters modify /usr/local/conda/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py as this:

209         self.token_embedding = nn.Embedding(config.vocab_size, embed_dim)
 210         with deepspeed.zero.GatheredParameters(self.token_embedding.weight,
 211                                                modifier_rank=0):
 212             # Initialize the position embeddings.
 213             nn.init.uniform_(self.token_embedding.weight, -1.0, 1)
 214 
 215         # deepspeed.zero.register_external_parameter(self, self.token_embedding.weight)

but it does not work.

Who can help?

No response

Information

Tasks

Reproduction

I am also experiencing the same issue as mentioned in https://github.com/huggingface/diffusers/issues/1865, therefore I have copied the reproduction steps from the original post.

  1. deepspeed_config:
    deepspeed_config_file: /home/kas/zero_stage3_offload_config.json
    zero3_init_flag: true
    distributed_type: DEEPSPEED
    fsdp_config: {}
    machine_rank: 0
    main_process_ip: null
    main_process_port: null
    main_training_function: main
    mixed_precision: fp16
    num_machines: 1
    num_processes: 4
    use_cpu: false
  2. /home/kas/zero_stage3_offload_config.json

    {
    "train_micro_batch_size_per_gpu": 16,
    "gradient_accumulation_steps":2,
    "train_batch_size":128,
    "steps_per_print": 2,
    "gradient_clipping": 1,
    "zero_optimization": {
    "stage": 3,
    "allgather_partitions": false,
    "allgather_bucket_size": 2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "stage3_max_live_parameters" : 2e8,
    "stage3_max_reuse_distance" : 2e8,
    "stage3_prefetch_bucket_size": 2e8,
    "stage3_param_persistence_threshold": 2e8,
    "sub_group_size" : 2e8,
    "round_robin_gradients": true
    },
    "bf16": {
    "enabled": true
    }
    }
  3. 
    git clone https://github.com/huggingface/diffusers.git
    cd expamles/text_to_imag

pip install deepspeed export MODEL_NAME="stabilityai/stable-diffusion-2" export dataset_name="lambdalabs/pokemon-blip-captions"

accelerate launch --config_file ./accelerate.yaml --mixed_precision="fp16" train_text_to_image.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --dataset_name=$dataset_name \ --use_ema \ --resolution=224 --center_crop --random_flip \ --train_batch_size=16 \ --gradient_accumulation_steps=2 \ --gradient_checkpointing \ --max_train_steps=500 \ --learning_rate=6e-5 \ --max_grad_norm=1 \ --lr_scheduler="constant_with_warmup" --lr_warmup_steps=0 \ --output_dir="sd-pokemon-model"


5.
```0%| | 0/500 [00:00<?, ?it/s] Steps: 0%| | 0/500 [00:00<?, ?it/s]Traceback (most recent call last):
File "train_text_to_image.py ", line 718, in <module>
main()
File "train_text_to_image.py ", line 648, in main
encoder_hidden_states = text_encoder(batch["input_ids"])[0]
File "/opt/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 739, in forward
return_dict=return_dict,
File "/opt/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 636, in forward
hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
File "/opt/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/lib/python3.7/site-packages/transformers/models/clip/modeling_clip.py", line 165, in forward
inputs_embeds = self.token_embedding(input_ids)
File "/opt/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 160, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/opt/miniconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 2183, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: 'weight' must be 2-D

Expected behavior

The goal is to be able to use Zero3 normally.

uygnef commented 1 year ago

@stas00 could please help me take a look at this issue?

stas00 commented 1 year ago

See https://github.com/huggingface/diffusers/pull/3076

Please carefully read the OP of the PR for details.

luochuwei commented 1 year ago

@uygnef Have you solved this problem?

uygnef commented 1 year ago

@luochuwei Yes, it works for training one model, but there seems to be an issue with training multiple models. I have submit the issue at https://github.com/microsoft/DeepSpeed/issues/3472

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.