SDXL-vanilla预训练后的模型，推理时输出的图片为噪声文件

Hartmon8 commented 2 months ago

Thanks for sending an issue! Here are some tips for you:

If this is your first time, please read our contributor guidelines: https://github.com/mindspore-ai/mindspore/blob/master/CONTRIBUTING.md

Hardware Environment | 硬件环境

please tell us what kind of hardware can reproduce your error? 请告诉我们您报错的后端类型
- [x] Ascend

Software Environment | 软件环境

MindSpore version: 请告诉我们您正在使用的MindSpore版本：
- [x] 2.2.11
Python version( 3.7.5):
OS( Linux Ubuntu 18.04)
GCC/Compiler version: 7.5.0

Describe the current behavior | 目前输出

使用SDXL Base和Refine模型，均可以正常输出图片
使用用户提供的ckpt文件，可以正常输出图片
在2中的ckpt的基础上，进行vanilla pretrain 6000步，对输出的图片进行测试，实测为噪声图片

正常图片： 6a5a764f0fbd4ef607ea0396213d3c9

异常图片： aebbc245ce9ca5a863aae36f0efc00d

Describe the expected behavior | 期望输出

输出正常的图片

Steps to reproduce the issue | 复现报错的步骤

配置文件:

version: SDXL-base-1.0
model:
    target: gm.models.diffusion.DiffusionEngine
    params:
        disable_first_stage_amp: True
        scale_factor: 0.5
        latents_mean:
          - -1.6574
          - 1.886
          - -1.383
          - 2.5155
        latents_std:
          - 8.4927
          - 5.9022
          - 6.5498
          - 5.2299

        denoiser_config:
            target: gm.modules.diffusionmodules.denoiser.Denoiser
            params:
                weighting_config:
                    target: gm.modules.diffusionmodules.denoiser_weighting.EDMWeighting
                    params:
                        sigma_data: 0.5
                scaling_config:
                    target: gm.modules.diffusionmodules.denoiser_scaling.EDMScaling
                    params:
                        sigma_data: 0.5

        network_config:
            target: gm.modules.diffusionmodules.openaimodel.UNetModel
            params:
                adm_in_channels: 2816
                num_classes: sequential
                in_channels: 4
                out_channels: 4
                model_channels: 320
                attention_resolutions: [4, 2]
                num_res_blocks: 2
                channel_mult: [1, 2, 4]
                num_head_channels: 64
                use_spatial_transformer: True
                use_linear_in_transformer: True
                transformer_depth: [1, 2, 10]  # note: the first is unused (due to attn_res starting at 2) 32, 16, 8 --> 64, 32, 16
                context_dim: 2048
                spatial_transformer_attn_type: flash-attention  # vanilla, flash-attention
                legacy: False
                use_recompute: True

        conditioner_config:
            target: gm.modules.GeneralConditioner
            params:
                emb_models:
                  # crossattn cond
                  - is_trainable: False
                    input_key: txt
                    target: gm.modules.embedders.modules.FrozenCLIPEmbedder
                    params:
                      layer: hidden
                      layer_idx: 11
                      version: /data/sdtest/models/models--openai--clip-vit-large-patch14/snapshots/8d052a0f05efbaefbc9e8786ba291cfdf93e5bff
                      # pretrained: ''
                  # crossattn and vector cond
                  - is_trainable: False
                    input_key: txt
                    target: gm.modules.embedders.modules.FrozenOpenCLIPEmbedder2
                    params:
                      arch: ViT-bigG-14-Text
                      freeze: True
                      layer: penultimate
                      always_return_pooled: True
                      legacy: False
                      require_pretrained: False
                      # pretrained: ''  # laion2b_s32b_b79k.ckpt
                  # vector cond
                  - is_trainable: False
                    input_key: original_size_as_tuple
                    target: gm.modules.embedders.modules.ConcatTimestepEmbedderND
                    params:
                      outdim: 256  # multiplied by two
                  # vector cond
                  - is_trainable: False
                    input_key: crop_coords_top_left
                    target: gm.modules.embedders.modules.ConcatTimestepEmbedderND
                    params:
                      outdim: 256  # multiplied by two
                  # vector cond
                  - is_trainable: False
                    input_key: target_size_as_tuple
                    target: gm.modules.embedders.modules.ConcatTimestepEmbedderND
                    params:
                      outdim: 256  # multiplied by two

        first_stage_config:
            target: gm.models.autoencoder.AutoencoderKLInferenceWrapper
            params:
                embed_dim: 4
                monitor: val/rec_loss
                ddconfig:
                    attn_type: vanilla
                    double_z: true
                    z_channels: 4
                    resolution: 256
                    in_channels: 3
                    out_ch: 3
                    ch: 128
                    ch_mult: [1, 2, 4, 4]
                    num_res_blocks: 2
                    attn_resolutions: []
                    dropout: 0.0
                lossconfig:
                    target: mindspore.nn.Identity

        sigma_sampler_config:
            target: gm.modules.diffusionmodules.sigma_sampling.EDMSampling
            params:
                p_mean: 0
                p_std: 0.6

        loss_fn_config:
            target: gm.modules.diffusionmodules.loss.StandardDiffusionLoss

optim:
    base_learning_rate: 1e-6

    optimizer_config:
        target: mindspore.nn.AdamWeightDecay  # mindspore.nn.SGD
        params:
            beta1: 0.9
            beta2: 0.999
            weight_decay: 0.01

    scheduler_config:
        target: gm.lr_scheduler.LambdaWarmUpScheduler
        params:
            warm_up_steps: 50

    # scheduler_config:
    #     target: gm.lr_scheduler.LambdaWarmUpCosineScheduler
    #     params:
    #         warm_up_steps: 62
    #         lr_min: 0.0
    #         lr_max: 1.0
    #         lr_start: 0.0
    #         max_decay_steps: -1

data:
    per_batch_size: 3
    num_epochs: 20
    num_parallel_workers: 32
    python_multiprocessing: True
    shuffle: True

    dataset_config:
        target: gm.data.dataset_wds.T2I_Webdataset
        params:
            caption_key: 'text_english'
            target_size: 1024
            transforms:
                - target: gm.data.mappers.Resize
                  params:
                    size: 1024
                    interpolation: 3
                - target: gm.data.mappers.Rescaler
                  params:
                    isfloat: False
                - target: gm.data.mappers.AddOriginalImageSizeAsTupleAndCropToSquare
                - target: gm.data.mappers.RandomHorizontalFlip
                - target: gm.data.mappers.Transpose
                  params:
                    type: hwc2chw

Related log / screenshot | 完整日志

Special notes for this issue | 其他信息

zhanghuiyao commented 2 months ago

训练命令行参数中是否用到了 --param_fp16 True 参数？

townwish4git commented 2 months ago

@Hartmon8 配置文件显示为EDM训练，请问使用的预训练权重也是通过EDM训练出来的吗？

Hartmon8 commented 2 months ago

@Hartmon8 配置文件显示为EDM训练，请问使用的预训练权重也是通过EDM训练出来的吗？

不是。关掉EDM配置选项后输出就正常了。

Thanks, 本Issue可以关闭。

mindspore-lab / mindone

SDXL-vanilla预训练后的模型，推理时输出的图片为噪声文件 #455