microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.44k stars 4.12k forks source link

[BUG] KeyError in stage_1_and_2.py when training dreambooth with deepspeed (in kohya_ss) #3718

Open me-fraud opened 1 year ago

me-fraud commented 1 year ago

Hello!

I've encountered an issue trying to run dreambooth training with deepspeed in kohya_ss.

I am running into the error, which seems to occure inside the deepspeed stage_1_and_2.py line 508 - 509: lp_name = self.param_names[lp] param_mapping_per_group[lp_name] = lp._hp_mapping.get_hp_fragment_address()

Additionaly i've tried to try - except these lines to see what happens, but ran into issues in the next parts of the code in engine.py (although not sure if it is somehow related).

my configuration is: 1GPU RTX 3060 (12Gb VRAM) WSL2 Ubuntu 22.04 in Windows 11 Cuda 11.7 Python 3.10.6 Torch 2.0.1+cu117 Accelerate 0.19.0 deepspeed 0.8.3 (although the problem is the same with 0.9.3) in training settings precision is set to fp16

deepspeed configuration JSON: { "zero_optimization": { "stage": 2 "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "reduce_scatter": true, "reduce_bucket_size": 2e8, "overlap_comm": true, "contiguous_gradients": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto" }

console output:

─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/me/kohya_ss/train_db.py:482 in │ │ │ │ 479 │ args = parser.parse_args() │ │ 480 │ args = train_util.read_config_from_file(args, parser) │ │ 481 │ │ │ ❱ 482 │ train(args) │ │ 483 │ │ │ │ /home/me/kohya_ss/train_db.py:202 in train │ │ │ │ 199 │ │ │ 200 │ # acceleratorがなんかよろしくやってくれるらしい │ │ 201 │ if train_text_encoder: │ │ ❱ 202 │ │ unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prep │ │ 203 │ │ │ unet, text_encoder, optimizer, train_dataloader, lr_scheduler │ │ 204 │ │ ) │ │ 205 │ else: │ │ │ │ /home/me/kohya_ss/venv/lib/python3.10/site-packages/accelerate/accelerator.py:1139 in prepare │ │ │ │ 1136 │ │ │ if self.device.type == "cpu" and self.state.ipex_plugin is not None: │ │ 1137 │ │ │ │ args = self._prepare_ipex(args) │ │ 1138 │ │ if self.distributed_type == DistributedType.DEEPSPEED: │ │ ❱ 1139 │ │ │ result = self._prepare_deepspeed(args) │ │ 1140 │ │ elif self.distributed_type == DistributedType.MEGATRON_LM: │ │ 1141 │ │ │ result = self._prepare_megatron_lm(*args) │ │ 1142 │ │ else: │ │ │ │ /home/me/kohya_ss/venv/lib/python3.10/site-packages/accelerate/accelerator.py:1446 in │ │ _prepare_deepspeed │ │ │ │ 1443 │ │ │ │ │ │ if type(scheduler).name in deepspeed.runtime.lr_schedules.VA │ │ 1444 │ │ │ │ │ │ │ kwargs["lrscheduler"] = scheduler │ │ 1445 │ │ │ │ │ ❱ 1446 │ │ │ engine, optimizer, , lr_scheduler = deepspeed.initialize(**kwargs) │ │ 1447 │ │ │ if optimizer is not None: │ │ 1448 │ │ │ │ optimizer = DeepSpeedOptimizerWrapper(optimizer) │ │ 1449 │ │ │ if scheduler is not None: │ │ │ │ /home/me/kohya_ss/venv/lib/python3.10/site-packages/deepspeed/init.py:125 in initialize │ │ │ │ 122 │ assert model is not None, "deepspeed.initialize requires a model" │ │ 123 │ │ │ 124 │ if not isinstance(model, PipelineModule): │ │ ❱ 125 │ │ engine = DeepSpeedEngine(args=args, │ │ 126 │ │ │ │ │ │ │ │ model=model, │ │ 127 │ │ │ │ │ │ │ │ optimizer=optimizer, │ │ 128 │ │ │ │ │ │ │ │ model_parameters=model_parameters, │ │ │ │ /home/me/kohya_ss/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py:340 in init │ │ │ │ 337 │ │ │ model_parameters = list(model_parameters) │ │ 338 │ │ │ │ 339 │ │ if has_optimizer: │ │ ❱ 340 │ │ │ self._configure_optimizer(optimizer, model_parameters) │ │ 341 │ │ │ self._configure_lr_scheduler(lr_scheduler) │ │ 342 │ │ │ self._report_progress(0) │ │ 343 │ │ elif self.zero_optimization(): │ │ │ │ /home/me/kohya_ss/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py:1298 in │ │ _configure_optimizer │ │ │ │ 1295 │ │ optimizer_wrapper = self._do_optimizer_sanity_check(basic_optimizer) │ │ 1296 │ │ │ │ 1297 │ │ if optimizer_wrapper == ZERO_OPTIMIZATION: │ │ ❱ 1298 │ │ │ self.optimizer = self._configure_zero_optimizer(basic_optimizer) │ │ 1299 │ │ elif optimizer_wrapper == AMP: │ │ 1300 │ │ │ amp_params = self.amp_params() │ │ 1301 │ │ │ log_dist(f"Initializing AMP with these params: {amp_params}", ranks=[0]) │ │ │ │ /home/me/kohya_ss/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py:1547 in │ │ _configure_zero_optimizer │ │ │ │ 1544 │ │ │ │ │ │ "Pipeline parallelism does not support overlapped communication, │ │ 1545 │ │ │ │ │ ) │ │ 1546 │ │ │ │ │ overlap_comm = False │ │ ❱ 1547 │ │ │ optimizer = DeepSpeedZeroOptimizer( │ │ 1548 │ │ │ │ optimizer, │ │ 1549 │ │ │ │ self.param_names, │ │ 1550 │ │ │ │ timers=timers, │ │ │ │ /home/me/kohya_ss/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:527 │ │ in init │ │ │ │ 524 │ │ │ │ 525 │ │ self._link_all_hp_params() │ │ 526 │ │ self._enable_universal_checkpoint() │ │ ❱ 527 │ │ self._param_slice_mappings = self._create_param_mapping() │ │ 528 │ │ │ 529 │ def _enable_universal_checkpoint(self): │ │ 530 │ │ for lp_param_group in self.bit16_groups: │ │ │ │ /home/me/kohya_ss/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:539 │ │ in _create_param_mapping │ │ │ │ 536 │ │ │ param_mapping_per_group = OrderedDict() │ │ 537 │ │ │ for lp in self.bit16_groups[i]: │ │ 538 │ │ │ │ if lp._hp_mapping is not None: │ │ ❱ 539 │ │ │ │ │ lp_name = self.param_names[lp] │ │ 540 │ │ │ │ │ param_mapping_per_group[ │ │ 541 │ │ │ │ │ │ lp_name] = lp._hp_mapping.get_hp_fragment_address() │ │ 542 │ │ │ param_mapping.append(param_mapping_per_group) │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ KeyError: Parameter containing: tensor([[[[-2.5410e-02, 2.5043e-02, 7.1978e-02], [-1.3399e-02, -1.3034e-01, 1.1476e-01], [-9.7030e-03, -1.3150e-02, 2.8044e-02]],

     [[ 3.8610e-02, -6.0800e-02,  3.4550e-03],
      [ 1.3344e-01, -1.0869e-01, -3.8528e-02],
      [ 7.1333e-03, -4.9282e-03,  1.3061e-02]],

     [[ 1.6261e-02, -1.8879e-02,  3.4788e-02],
      [-1.9644e-02,  2.3328e-02,  4.0197e-02],
      [-2.4416e-03, -5.7235e-03, -2.1267e-02]],

     [[ 2.1210e-02, -3.1675e-02,  1.7455e-02],
      [ 2.9178e-02, -8.6820e-02,  6.4746e-02],
      [ 3.3720e-03, -2.1977e-02,  2.1647e-02]]],

    [[[ 8.9986e-03, -1.0205e-02, -3.0476e-02],
      [ 7.5455e-03,  1.9113e-02,  8.7913e-02],
      [-9.9675e-05,  3.3088e-03,  1.4712e-02]],

     [[ 1.0064e-03,  8.5808e-03, -7.5712e-03],
      [-5.1672e-03,  5.0153e-02,  9.8676e-03],
      [-9.5510e-03,  1.8238e-02,  2.5396e-02]],

     [[-2.0479e-03,  2.9961e-02,  3.1176e-04],
      [ 1.8082e-02, -1.2043e-01,  6.9264e-03],
      [ 1.6751e-02, -3.0182e-02,  5.0824e-04]],

     [[-2.3328e-03, -2.6728e-02,  1.2321e-02],
      [-2.6235e-02,  4.4914e-02, -5.8993e-03],
      [-1.9181e-02,  1.2548e-02, -2.2108e-02]]],

    [[[-4.5197e-02, -4.5439e-02, -1.7462e-02],
      [ 3.6725e-02,  5.2502e-02, -8.2642e-03],
      [ 1.5603e-02,  2.8736e-02,  3.5283e-02]],

     [[ 2.8639e-03,  4.7068e-02,  2.3455e-02],
      [ 3.3651e-02, -8.0247e-02, -3.7098e-02],
      [-2.3571e-02,  9.5956e-03, -2.3156e-03]],

     [[ 3.7085e-03,  5.3470e-02, -3.7420e-03],
      [-4.5891e-02,  1.0218e-01, -3.4633e-02],
      [ 3.6263e-04, -4.9104e-02, -6.7825e-03]],

     [[-8.8059e-03, -1.0560e-02,  1.6182e-02],
      [-2.8848e-02,  3.9407e-02,  4.2363e-02],
      [-3.3508e-02,  2.1224e-02, -1.5888e-02]]],

    ...,

    [[[-6.4073e-03,  4.9710e-02, -1.2341e-02],
      [ 2.8699e-02,  9.1004e-02, -1.6671e-02],
      [ 1.0349e-02, -2.1209e-02,  1.2168e-02]],

     [[-9.1062e-04,  1.8167e-02,  3.1846e-02],
      [ 1.6944e-02,  8.0092e-02, -2.9738e-02],
      [-9.8830e-03,  4.9885e-02, -1.2700e-02]],

     [[ 1.2020e-02, -6.9198e-03,  1.8985e-02],
      [ 3.4355e-02, -3.0471e-02,  6.8234e-02],
      [ 8.9503e-03, -2.0973e-02,  7.2100e-02]],

     [[ 4.8034e-02,  5.1167e-02,  5.4771e-02],
      [ 4.2249e-02, -6.3657e-02,  1.5984e-02],
      [ 2.4852e-02, -2.7481e-02,  3.8589e-02]]],

    [[[-7.6125e-03,  1.9348e-02,  9.0864e-03],
      [ 5.3428e-02, -5.3440e-02,  3.8909e-02],
      [ 9.1220e-03,  4.8850e-02, -6.4069e-02]],

     [[-3.5119e-02, -2.3940e-02,  2.8393e-03],
      [ 1.6208e-02,  5.4991e-02,  4.0262e-02],
      [ 1.8849e-03,  3.1963e-02, -8.6199e-03]],

     [[ 4.8231e-02, -1.8867e-02,  3.4218e-02],
      [-2.5558e-03,  4.6577e-02, -3.2624e-03],
      [-2.2854e-02, -6.9764e-02, -6.9344e-02]],

     [[-9.7222e-03,  1.2386e-03, -2.0934e-02],
      [ 1.5920e-02, -2.0209e-02,  4.4601e-02],
      [ 2.6301e-02,  3.7617e-02,  1.3492e-03]]],

    [[[-1.0117e-01, -9.8085e-02, -1.1981e-02],
      [ 7.8942e-02, -3.0194e-02,  4.1531e-02],
      [ 4.0931e-02,  2.9909e-02,  4.6317e-02]],

     [[ 1.1388e-01,  4.9862e-02,  1.2630e-02],
      [ 8.5883e-02,  3.8244e-03, -1.8867e-02],
      [-7.1834e-02, -9.0345e-03, -5.3052e-02]],

     [[-5.2167e-03, -4.4715e-02, -2.2235e-02],
      [-8.5760e-03,  2.1861e-02, -1.8662e-02],
      [ 4.5497e-03,  1.9903e-02, -1.7304e-03]],

     [[-2.3223e-02, -4.5166e-02, -8.9723e-03],
      [ 2.2507e-02,  1.0017e-03,  2.8759e-02],
      [ 3.7623e-02,  6.9246e-03,  2.1055e-02]]]], device='cuda:0',
   requires_grad=True)

[02:21:28] ERROR failed (exitcode: 1) local_rank: 0 (pid: 129535) of binary: /home/me/kohya_ss/venv/bin/python3

congchan commented 1 year ago

Same issue here with CodeGen model

zedong-mt commented 1 year ago

anybody solve this problem

memray commented 1 year ago

Ran into the same issue.

Ting011 commented 1 year ago

Same Issue. Appreciate any hint.

mumianyuxin commented 1 year ago

same issue,anybody solve this problem

whcjb commented 6 days ago

same issue

tjruwase commented 4 days ago

@whcjb, can you please share full repro details? Thanks!