训练失败，怎么处理

chengyinglie commented 8 months ago

环境：aliyun PAI-DSW, modelscope:1.10.0-pytorch2.1.0tensorlow2.14.0-gpu-py310 按照步骤全部成功，只有部分部兼容报错，但上传图片后，点击训练，模型下载均没有问题。然后显示ERROR,具体后台日志如下：

Traceback (most recent call last): File "/mnt/workspace/facechain/facechain/train_text_to_image_lora.py", line 1224, in main() File "/mnt/workspace/facechain/facechain/train_text_to_image_lora.py", line 1036, in main accelerator.backward(loss) File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1989, in backward loss.backward(kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/opt/conda/lib/python3.10/site-packages/torch/autograd/init.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn Steps: 0%| | 0/200 [00:01<?, ?it/s] Traceback (most recent call last): File "/opt/conda/bin/accelerate", line 8, in sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 994, in launch_command simple_launcher(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 636, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '/mnt/workspace/facechain/facechain/train_text_to_image_lora.py', '--pretrained_model_name_or_path=ly261666/cv_portrait_model', '--revision=v2.0', '--sub_path=film/film', '--output_dataset_name=/mnt/workspace/facechain/worker_data/qw/training_data/ly261666/cv_portrait_model/person1', '--caption_column=text', '--resolution=512', '--random_flip', '--train_batch_size=1', '--num_train_epochs=200', '--checkpointing_steps=5000', '--learning_rate=1.5e-04', '--lr_scheduler=cosine', '--lr_warmup_steps=0', '--seed=42', '--output_dir=/mnt/workspace/facechain/worker_data/qw/ly261666/cv_portrait_model/person1', '--lora_r=4', '--lora_alpha=32', '--lora_text_encoder_r=32', '--lora_text_encoder_alpha=32', '--resume_from_checkpoint=fromfacecommon']' returned non-zero exit status 1. Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/gradio/queueing.py", line 407, in call_prediction output = await route_utils.call_process_api( File "/opt/conda/lib/python3.10/site-packages/gradio/route_utils.py", line 226, in call_process_api output = await app.get_blocks().process_api( File "/opt/conda/lib/python3.10/site-packages/gradio/blocks.py", line 1550, in process_api result = await self.call_function( File "/opt/conda/lib/python3.10/site-packages/gradio/blocks.py", line 1185, in call_function prediction = await anyio.to_thread.run_sync( File "/opt/conda/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "/opt/conda/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread return await future File "/opt/conda/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run result = context.run(func, args) File "/opt/conda/lib/python3.10/site-packages/gradio/utils.py", line 661, in wrapper response = f(args, kwargs) File "/mnt/workspace/facechain/app.py", line 804, in run train_lora_fn(base_model_path=base_model_path, File "/mnt/workspace/facechain/app.py", line 207, in train_lora_fn raise gr.Error("训练失败 (Training failed)") gradio.exceptions.Error: '训练失败 (Training failed)'

请各位帮忙看看

yanxinyixy commented 8 months ago

看样子应该是训练lora时vae latent的梯度没有开

在1100行拿到latent后加一句 latents.requires_grad_(True) 应该就可以了

chengyinglie commented 8 months ago

请问是那个文件哈，train_text_to_image_lora.py 还是 app.py 呢？没找到啊，多谢

bigmarten commented 8 months ago

训练失败：mmcv这个软件包在linux如何安装？ Process Process-1: Traceback (most recent call last): File "/home/saizong/.local/lib/python3.11/site-packages/modelscope/utils/registry.py", line 210, in build_from_cfg return obj_cls._instantiate(args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/saizong/.local/lib/python3.11/site-packages/modelscope/models/base/base_model.py", line 67, in _instantiate return cls(kwargs) ^^^^^^^^^^^^^ File "/home/saizong/.local/lib/python3.11/site-packages/modelscope/models/cv/face_detection/scrfd/damofd_detect.py", line 31, in init super().init(model_dir, **kwargs) File "/home/saizong/.local/lib/python3.11/site-packages/modelscope/models/cv/face_detection/scrfd/scrfd_detect.py", line 33, in init from mmcv import Config ModuleNotFoundError: No module named 'mmcv'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/saizong/.local/lib/python3.11/site-packages/modelscope/utils/registry.py", line 212, in build_from_cfg return obj_cls(args) ^^^^^^^^^^^^^^^ File "/home/saizong/.local/lib/python3.11/site-packages/modelscope/pipelines/cv/face_detection_pipeline.py", line 36, in init super().init(model=model, kwargs) File "/home/saizong/.local/lib/python3.11/site-packages/modelscope/pipelines/base.py", line 100, in init self.model = self.initiate_single_model(model, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/saizong/.local/lib/python3.11/site-packages/modelscope/pipelines/base.py", line 53, in initiate_single_model return Model.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^ File "/home/saizong/.local/lib/python3.11/site-packages/modelscope/models/base/base_model.py", line 183, in from_pretrained model = build_model(model_cfg, task_name=task_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/saizong/.local/lib/python3.11/site-packages/modelscope/models/builder.py", line 35, in build_model model = build_from_cfg( ^^^^^^^^^^^^^^^ File "/home/saizong/.local/lib/python3.11/site-packages/modelscope/utils/registry.py", line 215, in build_from_cfg raise type(e)(f'{obj_cls.name}: {e}') ModuleNotFoundError: DamoFdDetect: No module named 'mmcv'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/saizong/stable-diffusion/extensions/facechain/facechain/inference.py", line 25, in _data_process_fn_process Blipv2()(input_img_dir) ^^^^^^^^ File "/home/saizong/stable-diffusion/extensions/facechain/facechain/data_process/preprocessing.py", line 207, in init self.face_detection = pipeline(task=Tasks.face_detection, model='damo/cv_ddsar_face-detection_iclr23-damofd', model_revision='v1.1') ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/saizong/.local/lib/python3.11/site-packages/modelscope/pipelines/builder.py", line 170, in pipeline return build_pipeline(cfg, task_name=task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/saizong/.local/lib/python3.11/site-packages/modelscope/pipelines/builder.py", line 65, in build_pipeline return build_from_cfg( ^^^^^^^^^^^^^^^ File "/home/saizong/.local/lib/python3.11/site-packages/modelscope/utils/registry.py", line 215, in build_from_cfg raise type(e)(f'{obj_cls.name}: {e}') ModuleNotFoundError: FaceDetectionPipeline: DamoFdDetect: No module named 'mmcv'

ultimatech-cn commented 8 months ago

I resovle the error by adding "loss.requires_grad = True" after line 1033 in train_text_to_image_lora.py

              train_loss += avg_loss.item() / args.gradient_accumulation_steps

                loss.requires_grad = True

cweihua commented 8 months ago

I resovle the error by adding "loss.requires_grad = True" after line 1033 in train_text_to_image_lora.py
              train_loss += avg_loss.item() / args.gradient_accumulation_steps

                loss.requires_grad = True

I added this, but a new problem happened. Traceback (most recent call last):0%|████████████████████████████████████████████████████████████████████████████| 7/7 [00:03<00:00, 2.09it/s] File "/root/facechain-main/facechain/train_text_to_image_lora.py", line 1225, in main() File "/root/facechain-main/facechain/train_text_to_image_lora.py", line 1211, in main pipeline.unet.load_attn_procs(args.output_dir) File "/root/miniconda3/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/diffusers/loaders/unet.py", line 297, in load_attn_procs raise ValueError(f"Module {key} is not a LoRACompatibleConv or LoRACompatibleLinear module.") ValueError: Module down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_q is not a LoRACompatibleConv or LoRACompatibleLinear module.

chengyinglie commented 8 months ago

OK,thanks a lot!

clearlove88 commented 8 months ago

I resovle the error by adding "loss.requires_grad = True" after line 1033 in train_text_to_image_lora.py
              train_loss += avg_loss.item() / args.gradient_accumulation_steps

                loss.requires_grad = True
I added this, but a new problem happened. Traceback (most recent call last):0%|████████████████████████████████████████████████████████████████████████████| 7/7 [00:03<00:00, 2.09it/s] File "/root/facechain-main/facechain/train_text_to_image_lora.py", line 1225, in main() File "/root/facechain-main/facechain/train_text_to_image_lora.py", line 1211, in main pipeline.unet.load_attn_procs(args.output_dir) File "/root/miniconda3/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/diffusers/loaders/unet.py", line 297, in load_attn_procs raise ValueError(f"Module {key} is not a LoRACompatibleConv or LoRACompatibleLinear module.") ValueError: Module down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_q is not a LoRACompatibleConv or LoRACompatibleLinear module.

I met this problem too...

sunbaigui commented 5 months ago

please try out the newest train-free, 10s inference version facechain-fact.

modelscope / facechain

训练失败，怎么处理 #527