please tell us what kind of hardware can reproduce your error?
请告诉我们您报错的后端类型
[ ] Ascend
[X] GPU:3080 12G
[ ] CPU
Software Environment | 软件环境
MindSpore version:
请告诉我们您正在使用的MindSpore版本:
[X] 2.2.3
Python version(3.9.5):
OS(wsl2 docker desktop,Ubuntu 18.04.6 LTS)
GCC/Compiler version:9
Describe the current behavior | 目前输出
e.g. the current output is xxx/ the error is xxx/
目前的输出是xx 、 目前的报错是关于xx
MindSpore mode[GRAPH(0)/PYNATIVE(1)]: 1
Distributed mode: False
Data path: datasets/chinese_art_blip/train
Num params: 1,067,032,491 (unet: 860,318,148, text encoder: 123,060,480, vae: 83,653,863)
Num trainable params: 797,184
Precision: Float16
Use LoRA: True
LoRA rank: 4
Learning rate: 0.0001
Batch size: 1
Weight decay: 0.01
Grad accumulation steps: 1
Num epochs: 200
Loss scaler: dynamic
Init loss scale: 65536.0
Grad clipping: True
Max grad norm: 1.0
EMA: False
Enable flash attention: False
[2024-04-26 00:50:13] INFO: Start training...
[WARNING] PRE_ACT(70996,7f8ce37fe700,python):2024-04-26-00:50:19.865.072 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:303] CalMemBlockAllocSize] Memory not enough: current free memory size[0] is smaller than required size[29491200].
[ERROR] DEVICE(70996,7f8ce37fe700,python):2024-04-26-00:50:19.865.578 [mindspore/ccsrc/runtime/pynative/run_op_helper.cc:383] MallocForKernelOutput] Allocate output memory failed, node:Default/Cast-op446
Traceback (most recent call last):
File "/root/mindone/examples/stable_diffusion_v2/train_text_to_image.py", line 463, in
main(args)
File "/root/mindone/examples/stable_diffusion_v2/train_text_to_image.py", line 452, in main
model.train(
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 1068, in train
self._train(epoch,
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 114, in wrapper
func(self, *args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 617, in _train
self._train_process(epoch, train_dataset, list_callback, cb_params, initial_epoch, valid_infos)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 919, in _train_process
outputs = self._train_network(next_element)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(cast_inputs, kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/modules/train/trainer.py", line 95, in construct
loss = self.network(inputs) # mini-batch loss
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(cast_inputs, kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/models/diffusion/ddpm.py", line 402, in construct
return self.p_losses(x, c, t)
File "/root/mindone/examples/stable_diffusion_v2/ldm/models/diffusion/ddpm.py", line 407, in p_losses
model_output = self.apply_model(
File "/root/mindone/examples/stable_diffusion_v2/ldm/models/diffusion/ddpm.py", line 382, in apply_model
x_recon = self.model(x_noisy, t, cond, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(cast_inputs, kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/models/diffusion/ddpm.py", line 454, in construct
out = self.diffusion_model(x, t, context=context, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(cast_inputs, kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/modules/diffusionmodules/openaimodel.py", line 711, in construct
h = cell(h, emb, context)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(*cast_inputs, *kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/modules/diffusionmodules/openaimodel.py", line 197, in construct
h = self.in_layers_norm(x)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(cast_inputs, **kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/modules/diffusionmodules/util.py", line 115, in construct
return super().construct(x)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/layer/normalization.py", line 1188, in construct
self._check_input_dim(F.shape(x), self.cls_name)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/ops/function/arrayfunc.py", line 1510, in shape
return shape(input_x)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/ops/operations/array_ops.py", line 701, in call
return x.shape
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/common/_stub_tensor.py", line 85, in shape
self.stub_shape = self.stub.get_shape()
RuntimeError: Malloc for kernel output failed, Memory isn't enough, node:Default/Cast-op446
Describe the expected behavior | 期望输出
please describe expected outputs or functions you want to have:
请告诉我们您期望得到的结果或功能
使用kohyass的sd-scripts训练stable diffusion v1 lora image_size=(512,512) bs=1显存占用不会超过8G,使用12g显卡不应该会炸显存
Thanks for sending an issue! Here are some tips for you:
If this is your first time, please read our contributor guidelines: https://github.com/mindspore-ai/mindspore/blob/master/CONTRIBUTING.md
Hardware Environment | 硬件环境
Ascend
GPU
:3080 12GCPU
Software Environment | 软件环境
Describe the current behavior | 目前输出
e.g. the current output is xxx/ the error is xxx/ 目前的输出是xx 、 目前的报错是关于xx
MindSpore mode[GRAPH(0)/PYNATIVE(1)]: 1 Distributed mode: False Data path: datasets/chinese_art_blip/train Num params: 1,067,032,491 (unet: 860,318,148, text encoder: 123,060,480, vae: 83,653,863) Num trainable params: 797,184 Precision: Float16 Use LoRA: True LoRA rank: 4 Learning rate: 0.0001 Batch size: 1 Weight decay: 0.01 Grad accumulation steps: 1 Num epochs: 200 Loss scaler: dynamic Init loss scale: 65536.0 Grad clipping: True Max grad norm: 1.0 EMA: False Enable flash attention: False
[2024-04-26 00:50:13] INFO: Start training... [WARNING] PRE_ACT(70996,7f8ce37fe700,python):2024-04-26-00:50:19.865.072 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:303] CalMemBlockAllocSize] Memory not enough: current free memory size[0] is smaller than required size[29491200]. [ERROR] DEVICE(70996,7f8ce37fe700,python):2024-04-26-00:50:19.865.578 [mindspore/ccsrc/runtime/pynative/run_op_helper.cc:383] MallocForKernelOutput] Allocate output memory failed, node:Default/Cast-op446 Traceback (most recent call last): File "/root/mindone/examples/stable_diffusion_v2/train_text_to_image.py", line 463, in
main(args)
File "/root/mindone/examples/stable_diffusion_v2/train_text_to_image.py", line 452, in main
model.train(
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 1068, in train
self._train(epoch,
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 114, in wrapper
func(self, *args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 617, in _train
self._train_process(epoch, train_dataset, list_callback, cb_params, initial_epoch, valid_infos)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/train/model.py", line 919, in _train_process
outputs = self._train_network(next_element)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(cast_inputs, kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/modules/train/trainer.py", line 95, in construct
loss = self.network(inputs) # mini-batch loss
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(cast_inputs, kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/models/diffusion/ddpm.py", line 402, in construct
return self.p_losses(x, c, t)
File "/root/mindone/examples/stable_diffusion_v2/ldm/models/diffusion/ddpm.py", line 407, in p_losses
model_output = self.apply_model(
File "/root/mindone/examples/stable_diffusion_v2/ldm/models/diffusion/ddpm.py", line 382, in apply_model
x_recon = self.model(x_noisy, t, cond, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(cast_inputs, kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/models/diffusion/ddpm.py", line 454, in construct
out = self.diffusion_model(x, t, context=context, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(cast_inputs, kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/modules/diffusionmodules/openaimodel.py", line 711, in construct
h = cell(h, emb, context)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(*cast_inputs, *kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/modules/diffusionmodules/openaimodel.py", line 197, in construct
h = self.in_layers_norm(x)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(cast_inputs, **kwargs)
File "/root/mindone/examples/stable_diffusion_v2/ldm/modules/diffusionmodules/util.py", line 115, in construct
return super().construct(x)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/nn/layer/normalization.py", line 1188, in construct
self._check_input_dim(F.shape(x), self.cls_name)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/ops/function/arrayfunc.py", line 1510, in shape
return shape(input_x)
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/ops/operations/array_ops.py", line 701, in call
return x.shape
File "/root/miniconda3/envs/mindspore_py39/lib/python3.9/site-packages/mindspore/common/_stub_tensor.py", line 85, in shape
self.stub_shape = self.stub.get_shape()
RuntimeError: Malloc for kernel output failed, Memory isn't enough, node:Default/Cast-op446
Describe the expected behavior | 期望输出
please describe expected outputs or functions you want to have: 请告诉我们您期望得到的结果或功能 使用kohyass的sd-scripts训练stable diffusion v1 lora image_size=(512,512) bs=1显存占用不会超过8G,使用12g显卡不应该会炸显存
Steps to reproduce the issue | 复现报错的步骤
export DEVICE_ID=0
for non-INFNAN, keep drop overflow update False
export MS_ASCEND_CHECK_OVERFLOW_MODE=1
export MS_ASCEND_CHECK_OVERFLOW_MODE="INFNAN_MODE" # debuggin
task_name=train_lora_sdv1 #rewrite output_path=outputs output_dir=$output_path/$task_name
rm -rf $output_dir mkdir -p $output_dir python train_text_to_image.py \ --train_config "configs/train/train_config_lora_v1.yaml" \ --data_path "datasets/chinese_art_blip/train" \ --output_path $output_dir \ --pretrained_model_path "models/AnythingV5.ckpt" \ --loss_scaler_type "dynamic" \ --init_loss_scale 65536 \ --enable_flash_attention=False \ --drop_overflow_update=True \ --use_ema=False \ --lora_rank=4 \ --epochs=200 \ --ckpt_save_interval=20 \ --mode 1 \ --train_batch_size=1 \
Related log / screenshot | 完整日志
Special notes for this issue | 其他信息