mindspore-lab / mindone

one for all, Optimal generator with No Exception
https://mindspore-lab.github.io/mindone/
Apache License 2.0
351 stars 66 forks source link

使用examples/diffusers/text_to_image/train_text_to_image_lora.py微调出错 #629

Open ChjxL opened 1 month ago

ChjxL commented 1 month ago

Hardware Environment | 硬件环境

Software Environment | 软件环境

Describe the current behavior | 目前输出

2024-08-14 17:10:00,056 - modelscope - INFO - PyTorch version 2.1.0 Found. 2024-08-14 17:10:00,061 - modelscope - INFO - Loading ast index from /home/ma-user/.cache/modelscope/ast_indexer 2024-08-14 17:10:00,123 - modelscope - INFO - Loading done! Current index file version is 1.15.0, with md5 54d31e3d3abbdd999283f7b24d7db88f and a total number of 980 components indexed /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch_npu/utils/path_manager.py:79: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/latest owner does not match the current user. warnings.warn(f"Warning: The {path} owner does not match the current user.") /home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch_npu/utils/path_manager.py:79: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/8.0.RC1/aarch64-linux/ascend_toolkit_install.info owner does not match the current user. warnings.warn(f"Warning: The {path} owner does not match the current user.") 08/14/2024 17:10:57 - INFO - main - UNet2DConditionModel ==> Trainable params: 797,184 || All params: 860,318,148 || Trainable ratio: 0.09266153% Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 297.08it/s] You have disabled the safety checker for <class 'mindone.diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing safety_checker=None. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 . 08/14/2024 17:11:01 - INFO - main - Running training 08/14/2024 17:11:01 - INFO - main - Num examples = 856 08/14/2024 17:11:01 - INFO - main - Num Epochs = 100 08/14/2024 17:11:01 - INFO - main - Instantaneous batch size per device = 1 08/14/2024 17:11:01 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 1 08/14/2024 17:11:01 - INFO - main - Gradient Accumulation steps = 1 08/14/2024 17:11:01 - INFO - main - Total optimization steps = 85600 08/14/2024 17:11:01 - INFO - main - Running validation... Generating 4 images with prompt: a man in a straw hat. 08/14/2024 17:15:17 - INFO - main - Validation done. Steps: 0%| | 0/85600 [00:00<?, ?it/s]Traceback (most recent call last): File "/home/ma-user/work/mindone/examples/diffusers/text_to_image/train_text_to_image_lora.py", line 955, in main() File "/home/ma-user/work/mindone/examples/diffusers/text_to_image/train_text_to_image_lora.py", line 790, in main loss, model_pred = train_step(batch) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/nn/cell.py", line 703, in call out = self.compile_and_run(args, *kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/nn/cell.py", line 1071, in compile_and_run self.compile(args, *kwargs) File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/nn/cell.py", line 1054, in compile _cell_graph_executor.compile(self, self._compile_args, phase=self.phase, File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/mindspore/common/api.py", line 1819, in compile result = self._graph_executor.compile(obj, args, kwargs, phase, self._use_vm_mode()) TypeError: getattr(): attribute name must be string but got: External



Describe the expected behavior | 期望输出

1、正常训练成功

Steps to reproduce the issue | 复现报错的步骤

执行python train_text_to_image_lora.py --pretrained_model_name_or_path=/home/ma-user/work/stable-diffusion-v1-4/ --dataset_name=/home/ma-user/work/onepiece-blip-captions/ --resolution=512 --center_crop --random_flip --train_batch_size=1 --num_train_epochs=100 --checkpointing_steps=5000 --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 --mixed_precision="fp16" --seed=42 --validation_prompt="a man in a straw hat" --output_dir="sd-onepiece-model-lora-$(date +%Y%m%d%H%M%S)"

vigo999 commented 3 weeks ago

@The-truthh please check

ChjxL commented 3 weeks ago

显卡类型补充信息

image

完整报错信息如下

执行命令:python train_text_to_image_lora.py --pretrained_model_name_or_path=/home/ma-user/work/maj/ --dataset_name=/home/ma-user/work/wgdz/wgdz/ --resolution=768 --center_crop --random_flip --train_batch_size=1 --max_train_steps=15000 --learning_rate=1e-05 --max_grad_norm=1 --mixed_precision="fp16" --lr_scheduler="constant" --lr_warmup_steps=0 --output_dir="sd-onepiece-model-$(date +%Y%m%d%H%M%S)

image image

ChjxL commented 3 weeks ago

不同版本的mindspore会出现不同的报错

这里使用的是cann8,使用2.2.0的mindspore会产生以下报错,与之前的报错会不同: image image image

使用2.2.0的mindspore不是我们主要的问题,只是在尝试更换mindspore版本时候出现不同的问题,所以向官方同步一下

vigo999 commented 1 week ago

您好,我们收到这个问题。diffusers兼容的组件进行 sd xl lora 训练 修复中,尽快更新到readme里面去。 @The-truthh @liuchuting

ChjxL commented 1 week ago

您好,我们收到这个问题。diffusers兼容的组件进行 sd xl lora 训练 修复中,尽快更新到readme里面去。 @The-truthh @liuchuting

好的,感谢您的回复。期待更新!