[SDXL] save_file bug on DeepSpeed

Hi,

I met a bug on save_file when I used DeepSpeed. How should I fix it?

override steps. steps for 1 epochs is / 指定エポックまでのステップ数: 18165
[2023-07-09 14:35:23,072] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.5, git-hash=unknown, git-branch=unknown
[2023-07-09 14:35:25,118] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-07-09 14:35:25,120] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2023-07-09 14:35:25,120] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2023-07-09 14:35:25,294] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2023-07-09 14:35:25,294] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2023-07-09 14:35:25,294] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2023-07-09 14:35:25,294] [INFO] [stage_1_and_2.py:133:__init__] Reduce bucket size 500,000,000
[2023-07-09 14:35:25,294] [INFO] [stage_1_and_2.py:134:__init__] Allgather bucket size 500,000,000
[2023-07-09 14:35:25,294] [INFO] [stage_1_and_2.py:135:__init__] CPU Offload: False
[2023-07-09 14:35:25,294] [INFO] [stage_1_and_2.py:136:__init__] Round robin gradient partitioning: False
Rank: 0 partition count [2] and sizes[(1283731842, False)] 
Rank: 1 partition count [2] and sizes[(1283731842, False)] 
[2023-07-09 14:35:28,817] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states
[2023-07-09 14:35:28,817] [INFO] [utils.py:786:see_memory_usage] MA 9.73 GB         Max_MA 12.12 GB         CA 12.12 GB         Max_CA 12 GB 
[2023-07-09 14:35:28,817] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 42.89 GB, percent = 34.1%
[2023-07-09 14:35:29,565] [INFO] [utils.py:785:see_memory_usage] After initializing optimizer states
[2023-07-09 14:35:29,565] [INFO] [utils.py:786:see_memory_usage] MA 19.29 GB         Max_MA 33.64 GB         CA 36.04 GB         Max_CA 36 GB 
[2023-07-09 14:35:29,565] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 42.89 GB, percent = 34.1%
[2023-07-09 14:35:29,565] [INFO] [stage_1_and_2.py:488:__init__] optimizer state initialized
[2023-07-09 14:35:30,307] [INFO] [utils.py:785:see_memory_usage] After initializing ZeRO optimizer
[2023-07-09 14:35:30,308] [INFO] [utils.py:786:see_memory_usage] MA 19.29 GB         Max_MA 19.29 GB         CA 36.04 GB         Max_CA 36 GB 
[2023-07-09 14:35:30,308] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 42.89 GB, percent = 34.1%
[2023-07-09 14:35:30,318] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2023-07-09 14:35:30,318] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-07-09 14:35:30,318] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2023-07-09 14:35:30,318] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-06], mom=[(0.9, 0.999)]
[2023-07-09 14:35:30,320] [INFO] [config.py:960:print] DeepSpeedEngine configuration:
[2023-07-09 14:35:30,320] [INFO] [config.py:964:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2023-07-09 14:35:30,320] [INFO] [config.py:964:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-07-09 14:35:30,320] [INFO] [config.py:964:print]   amp_enabled .................. False
[2023-07-09 14:35:30,320] [INFO] [config.py:964:print]   amp_params ................... False
[2023-07-09 14:35:30,320] [INFO] [config.py:964:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2023-07-09 14:35:30,320] [INFO] [config.py:964:print]   bfloat16_enabled ............. True
[2023-07-09 14:35:30,320] [INFO] [config.py:964:print]   checkpoint_parallel_write_pipeline  False
[2023-07-09 14:35:30,320] [INFO] [config.py:964:print]   checkpoint_tag_validation_enabled  True
[2023-07-09 14:35:30,320] [INFO] [config.py:964:print]   checkpoint_tag_validation_fail  False
[2023-07-09 14:35:30,320] [INFO] [config.py:964:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fa07b98c820>
[2023-07-09 14:35:30,320] [INFO] [config.py:964:print]   communication_data_type ...... None
[2023-07-09 14:35:30,320] [INFO] [config.py:964:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   curriculum_enabled_legacy .... False
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   curriculum_params_legacy ..... False
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   data_efficiency_enabled ...... False
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   dataloader_drop_last ......... False
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   disable_allgather ............ False
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   dump_state ................... False
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   dynamic_loss_scale_args ...... None
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   eigenvalue_enabled ........... False
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   eigenvalue_gas_boundary_resolution  1
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   eigenvalue_layer_num ......... 0
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   eigenvalue_max_iter .......... 100
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   eigenvalue_stability ......... 1e-06
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   eigenvalue_tol ............... 0.01
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   eigenvalue_verbose ........... False
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   elasticity_enabled ........... False
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   fp16_auto_cast ............... None
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   fp16_enabled ................. False
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   fp16_master_weights_and_gradients  False
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   global_rank .................. 0
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   grad_accum_dtype ............. None
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   gradient_accumulation_steps .. 16
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   gradient_clipping ............ 1.0
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   gradient_predivide_factor .... 1.0
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   initial_dynamic_scale ........ 1
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   load_universal_checkpoint .... False
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   loss_scale ................... 1.0
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   memory_breakdown ............. False
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   mics_hierarchial_params_gather  False
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   mics_shard_size .............. -1
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   optimizer_legacy_fusion ...... False
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   optimizer_name ............... None
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   optimizer_params ............. None
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   pld_enabled .................. False
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   pld_params ................... False
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   prescale_gradients ........... False
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   scheduler_name ............... None
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   scheduler_params ............. None
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   sparse_attention ............. None
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   sparse_gradients_enabled ..... False
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   steps_per_print .............. inf
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   train_batch_size ............. 32
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   train_micro_batch_size_per_gpu  1
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   use_node_local_storage ....... False
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   wall_clock_breakdown ......... False
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   world_size ................... 2
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   zero_allow_untested_optimizer  True
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   zero_enabled ................. True
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   zero_force_ds_cpu_optimizer .. True
[2023-07-09 14:35:30,321] [INFO] [config.py:964:print]   zero_optimization_stage ...... 2
[2023-07-09 14:35:30,321] [INFO] [config.py:950:print_user_config]   json = {
    "train_batch_size": 32, 
    "train_micro_batch_size_per_gpu": 1, 
    "gradient_accumulation_steps": 16, 
    "zero_optimization": {
        "stage": 2, 
        "offload_optimizer": {
            "device": "none"
        }, 
        "offload_param": {
            "device": "none"
        }, 
        "stage3_gather_16bit_weights_on_model_save": false
    }, 
    "gradient_clipping": 1.0, 
    "steps_per_print": inf, 
    "bf16": {
        "enabled": true
    }, 
    "fp16": {
        "enabled": false
    }, 
    "zero_allow_untested_optimizer": true
}
running training / 学習開始
  num examples / サンプル数: 4650083
  num batches per epoch / 1epochのバッチ数: 290639
  num epochs / epoch数: 1
  batch size per device / バッチサイズ: 1
  total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ（並列学習、勾配合計含む）: 32
  gradient accumulation steps / 勾配を合計するステップ数 = 16
  total optimization steps / 学習ステップ数: 18165
steps:   0%|                                          | 0/18165 [00:00<?, ?it/s]
epoch 1/1
/mnt/my_raid/github/sd-scripts/venv/lib/python3.9/site-packages/xformers/ops/fmha/flash.py:339: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  and inp.query.storage().data_ptr() == inp.key.storage().data_ptr()
/mnt/my_raid/github/sd-scripts/venv/lib/python3.9/site-packages/xformers/ops/fmha/flash.py:339: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  and inp.query.storage().data_ptr() == inp.key.storage().data_ptr()
steps:   1%|                 | 100/18165 [07:45<23:20:16,  4.65s/it, loss=0.112]
saving checkpoint: /media/ozakiy/NVM/kari/kari_xl-step00000100.safetensors
Traceback (most recent call last):
  File "/mnt/my_raid/github/sd-scripts/sdxl_train.py", line 618, in <module>
    train(args)
  File "/mnt/my_raid/github/sd-scripts/sdxl_train.py", line 477, in train
    sdxl_train_util.save_sd_model_on_epoch_end_or_stepwise(
  File "/mnt/my_raid/github/sd-scripts/library/sdxl_train_util.py", line 306, in save_sd_model_on_epoch_end_or_stepwise
    train_util.save_sd_model_on_epoch_end_or_stepwise_common(
  File "/mnt/my_raid/github/sd-scripts/library/train_util.py", line 3561, in save_sd_model_on_epoch_end_or_stepwise_common
    sd_saver(ckpt_file, epoch_no, global_step)
  File "/mnt/my_raid/github/sd-scripts/library/sdxl_train_util.py", line 290, in sd_saver
    sdxl_model_util.save_stable_diffusion_checkpoint(
  File "/mnt/my_raid/github/sd-scripts/library/sdxl_model_util.py", line 312, in save_stable_diffusion_checkpoint
    save_file(state_dict, output_file)
  File "/mnt/my_raid/github/sd-scripts/venv/lib/python3.9/site-packages/safetensors/torch.py", line 232, in save_file
    serialize_file(_flatten(tensors), filename, metadata=metadata)
  File "/mnt/my_raid/github/sd-scripts/venv/lib/python3.9/site-packages/safetensors/torch.py", line 394, in _flatten
    raise RuntimeError(
RuntimeError: 
            Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{

            A potential way to correctly save your model is to use `save_model`.
            More information at https://huggingface.co/docs/safetensors/torch_shared_tensors

steps:   1%|                 | 100/18165 [07:46<23:24:42,  4.67s/it, loss=0.112]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 22397 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 22396) of binary: /mnt/my_raid/github/sd-scripts/venv/bin/python
Traceback (most recent call last):
  File "/mnt/my_raid/github/sd-scripts/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/mnt/my_raid/github/sd-scripts/venv/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/mnt/my_raid/github/sd-scripts/venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 903, in launch_command
    deepspeed_launcher(args)
  File "/mnt/my_raid/github/sd-scripts/venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 648, in deepspeed_launcher
    distrib_run.run(args)
  File "/mnt/my_raid/github/sd-scripts/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/mnt/my_raid/github/sd-scripts/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/my_raid/github/sd-scripts/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
sdxl_train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-09_14:43:24
  host      : balthasar
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 22396)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Best

kohya-ss / sd-scripts

[SDXL] save_file bug on DeepSpeed #629