mindspore-lab / mindone

one for all, Optimal generator with No Exception
Apache License 2.0
329 stars 63 forks source link

stable diffusion v1 lora训练显存占用异常 #465

Open ultranationalism opened 2 months ago

ultranationalism commented 2 months ago

Thanks for sending an issue! Here are some tips for you:

If this is your first time, please read our contributor guidelines: https://github.com/mindspore-ai/mindspore/blob/master/CONTRIBUTING.md

Hardware Environment | 硬件环境

Software Environment | 软件环境

Describe the current behavior | 目前输出

e.g. the current output is xxx/ the error is xxx/ 目前的输出是xx 、 目前的报错是关于xx

MindSpore mode[GRAPH(0)/PYNATIVE(1)]: 1 Distributed mode: False Data path: datasets/chinese_art_blip/train Num params: 1,067,032,491 (unet: 860,318,148, text encoder: 123,060,480, vae: 83,653,863) Num trainable params: 797,184 Precision: Float16 Use LoRA: True LoRA rank: 4 Learning rate: 0.0001 Batch size: 1 Weight decay: 0.01 Grad accumulation steps: 1 Num epochs: 200 Loss scaler: dynamic Init loss scale: 65536.0 Grad clipping: True Max grad norm: 1.0 EMA: False Enable flash attention: False

[2024-04-26 00:50:13] INFO: Start training... [WARNING] PRE_ACT(70996,7f8ce37fe700,python):2024-04-26-00:50:19.865.072 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:303] CalMemBlockAllocSize] Memory not enough: current free memory size[0] is smaller than required size[29491200]. [ERROR] DEVICE(70996,7f8ce37fe700,python):2024-04-26-00:50:19.865.578 [mindspore/ccsrc/runtime/pynative/run_op_helper.cc:383] MallocForKernelOutput] Allocate output memory failed, node:Default/Cast-op446 Traceback (most recent call last): File "/root/mindone/examples/stable_diffusion_v2/train_text_to_image.py", line 463, in main(args) File "/root/mindone/examples/stable_diffusion_v2/train_text_to_image.py", line 452, in main model.train( File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 1068, in train self._train(epoch, File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 114, in wrapper func(self, *args, kwargs) File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 617, in _train self._train_process(epoch, train_dataset, list_callback, cb_params, initial_epoch, valid_infos) File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 919, in _train_process outputs = self._train_network(next_element) File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call raise err File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call output = self._run_construct(args, kwargs) File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct output = self.construct(cast_inputs, kwargs) File "/root/mindone/examples/stable_diffusion_v2/ldm/modules/train/trainer.py", line 95, in construct loss = self.network(inputs) # mini-batch loss File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call raise err File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call output = self._run_construct(args, kwargs) File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct output = self.construct(cast_inputs, kwargs) File "/root/mindone/examples/stable_diffusion_v2/ldm/models/diffusion/ddpm.py", line 402, in construct return self.p_losses(x, c, t) File "/root/mindone/examples/stable_diffusion_v2/ldm/models/diffusion/ddpm.py", line 407, in p_losses model_output = self.apply_model( File "/root/mindone/examples/stable_diffusion_v2/ldm/models/diffusion/ddpm.py", line 382, in apply_model x_recon = self.model(x_noisy, t, cond, kwargs) File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call raise err File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call output = self._run_construct(args, kwargs) File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct output = self.construct(cast_inputs, kwargs) File "/root/mindone/examples/stable_diffusion_v2/ldm/models/diffusion/ddpm.py", line 454, in construct out = self.diffusion_model(x, t, context=context, kwargs) File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call raise err File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call output = self._run_construct(args, kwargs) File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct output = self.construct(cast_inputs, kwargs) File "/root/mindone/examples/stable_diffusion_v2/ldm/modules/diffusionmodules/openaimodel.py", line 711, in construct h = cell(h, emb, context) File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call raise err File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call output = self._run_construct(args, kwargs) File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct output = self.construct(*cast_inputs, *kwargs) File "/root/mindone/examples/stable_diffusion_v2/ldm/modules/diffusionmodules/openaimodel.py", line 197, in construct h = self.in_layers_norm(x) File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call raise err File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call output = self._run_construct(args, kwargs) File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct output = self.construct(cast_inputs, **kwargs) File "/root/mindone/examples/stable_diffusion_v2/ldm/modules/diffusionmodules/util.py", line 115, in construct return super().construct(x) File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/layer/normalization.py", line 1188, in construct self._check_input_dim(F.shape(x), self.cls_name) File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/ops/function/arrayfunc.py", line 1510, in shape return shape(input_x) File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/ops/operations/array_ops.py", line 701, in call return x.shape File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/common/_stub_tensor.py", line 85, in shape self.stub_shape = self.stub.get_shape() RuntimeError: Malloc for kernel output failed, Memory isn't enough, node:Default/Cast-op446

Describe the expected behavior | 期望输出

please describe expected outputs or functions you want to have: 请告诉我们您期望得到的结果或功能 使用kohyass的sd-scripts训练stable diffusion v1 lora image_size=(512,512) bs=1显存占用不会超过8G,使用12g显卡不应该会炸显存

Steps to reproduce the issue | 复现报错的步骤

export DEVICE_ID=0

for non-INFNAN, keep drop overflow update False

export MS_ASCEND_CHECK_OVERFLOW_MODE=1

export MS_ASCEND_CHECK_OVERFLOW_MODE="INFNAN_MODE" # debuggin

task_name=train_lora_sdv1 #rewrite output_path=outputs output_dir=$output_path/$task_name

rm -rf $output_dir mkdir -p $output_dir python train_text_to_image.py \ --train_config "configs/train/train_config_lora_v1.yaml" \ --data_path "datasets/chinese_art_blip/train" \ --output_path $output_dir \ --pretrained_model_path "models/AnythingV5.ckpt" \ --loss_scaler_type "dynamic" \ --init_loss_scale 65536 \ --enable_flash_attention=False \ --drop_overflow_update=True \ --use_ema=False \ --lora_rank=4 \ --epochs=200 \ --ckpt_save_interval=20 \ --mode 1 \ --train_batch_size=1 \

Related log / screenshot | 完整日志

Special notes for this issue | 其他信息

Songyuanwei commented 2 months ago

建议设置静态图模式。mode为0,应该可以在12G显卡运行

ultranationalism commented 2 months ago

建议设置静态图模式。mode为0,应该可以在12G显卡运行

经测试,静态图模式下mindspore占用完我预留给容器的56G内存后直接导致了我docker容器崩溃