hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.74k stars 4.34k forks source link

[BUG]: examples/images/diffusion ran failed #1951

Closed GxjGit closed 1 year ago

GxjGit commented 1 year ago

🐛 Describe the bug

I ran the example of diffusion according to https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/diffusion: steps: conda env create -f environment.yaml conda activate ldm pip install colossalai==0.1.10+torch1.11cu11.3 -f https://release.colossalai.org git clone https://github.com/Lightning-AI/lightning && cd lightning && git reset --hard b04a7aa pip install -r requirements.txt && pip install .

dataset: laion-400m

run: bash train.sh

failed info:

/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py:438: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check. rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.") Traceback (most recent call last): File "/home/code/ColossalAI/examples/images/diffusion/main.py", line 811, in trainer.fit(model, data) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in fit call._call_and_handle_interrupt( File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 621, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in _run results = self._run_stage() File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1137, in _run_stage self._run_train() File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1160, in _run_train self.fit_loop.run() File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(args, kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 214, in advance batch_output = self.batch_loop.run(kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, *kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance outputs = self.optimizer_loop.run(optimizers, kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(args, kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 200, in advance result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position]) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 247, in _run_optimization self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 357, in _optimizer_step self.trainer._call_lightning_module_hook( File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1302, in _call_lightning_module_hook output = fn(args, kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1661, in optimizer_step optimizer.step(closure=optimizer_closure) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 368, in optimizer_step return self.precision_plugin.optimizer_step( File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/colossalai.py", line 74, in optimizer_step closure_result = closure() File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 147, in call self._result = self.closure(args, kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 133, in closure step_output = self._step_fn() File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 406, in _training_step training_step_output = self.trainer._call_strategy_hook("training_step", kwargs.values()) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1440, in _call_strategy_hook output = fn(args, kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 352, in training_step return self.model(*args, kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, *kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 241, in forward outputs = self.module(args, kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, kwargs) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/overrides/base.py", line 98, in forward output = self._forward_module.training_step(*inputs, *kwargs) File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 411, in training_step loss, loss_dict = self.shared_step(batch) File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 976, in shared_step loss = self(x, c) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 988, in forward return self.p_losses(x, c, t, args, kwargs) File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 1122, in p_losses model_output = self.apply_model(x_noisy, t, cond) File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 1094, in apply_model x_recon = self.model(x_noisy, t, cond) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 1519, in forward out = self.diffusion_model(x, t, context=cc) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, *kwargs) File "/home/code/ColossalAI/examples/images/diffusion/ldm/modules/diffusionmodules/openaimodel.py", line 927, in forward h = th.cat([h, hs.pop()], dim=1) File "/opt/conda/envs/ldm/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 170, in __torch_function__ ret = func(args, kwargs) RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.**

Environment

image

binmakeswell commented 1 year ago

Hi @GxjGit Thank you for your feedback. We will try to reproduce your issue and fix it soon.

Fazziekey commented 1 year ago

can you upload your train.yaml

GxjGit commented 1 year ago

I have not modified the yaml setting.
In addition, As I can't find the training model, I have annotated the code of pretrained model loading of UNetModel and AutoencoderKL. I have no idea if it is concerned.

image

image

  model:
  base_learning_rate: 1.0e-04
  target: ldm.models.diffusion.ddpm.LatentDiffusion
  params:
    linear_start: 0.00085
    linear_end: 0.0120
    num_timesteps_cond: 1
    log_every_t: 200
    timesteps: 1000
    first_stage_key: image
    cond_stage_key: caption
    image_size: 64
    channels: 4
    cond_stage_trainable: false   # Note: different from the one we trained before
    conditioning_key: crossattn
    monitor: val/loss_simple_ema
    scale_factor: 0.18215
    use_ema: False
    scheduler_config: # 10000 warmup steps
      target: ldm.lr_scheduler.LambdaLinearScheduler
      params:
        warm_up_steps: [ 1 ] # NOTE for resuming. use 10000 if starting from scratch
        cycle_lengths: [ 10000000000000 ] # incredibly large number to prevent corner cases
        f_start: [ 1.e-6 ]
        f_max: [ 1.e-4 ]
        f_min: [ 1.e-10 ]
    unet_config:
      target: ldm.modules.diffusionmodules.openaimodel.UNetModel
      params:
        image_size: 32 # unused
        from_pretrained: '/data/scratch/diffuser/stable-diffusion-v1-4/unet/diffusion_pytorch_model.bin'
        in_channels: 4
        out_channels: 4
        model_channels: 320
        attention_resolutions: [ 4, 2, 1 ]
        num_res_blocks: 2
        channel_mult: [ 1, 2, 4, 4 ]
        num_heads: 8
        use_spatial_transformer: True
        transformer_depth: 1
        context_dim: 768
        use_checkpoint: False
        legacy: False
    first_stage_config:
      target: ldm.models.autoencoder.AutoencoderKL
      params:
        embed_dim: 4
        from_pretrained: '/data/scratch/diffuser/stable-diffusion-v1-4/vae/diffusion_pytorch_model.bin'
        monitor: val/rec_loss
        ddconfig:
          double_z: true
          z_channels: 4
          resolution: 256
          in_channels: 3
          out_ch: 3
          ch: 128
          ch_mult:
          - 1
          - 2
          - 4
          - 4
          num_res_blocks: 2
          attn_resolutions: []
          dropout: 0.0
        lossconfig:
          target: torch.nn.Identity
    cond_stage_config:
      target: ldm.modules.encoders.modules.FrozenCLIPEmbedder
      params:
        use_fp16: True
data:
  target: main.DataModuleFromConfig
  params:
    batch_size: 64
    wrap: False
    train:
      target: ldm.data.base.Txt2ImgIterableBaseDataset
      params:
        file_path: "/home/notebook/data/group/huangxin/laion-400m/e-commerce/e-commerce-0.tsv"
        world_size: 1
        rank: 0
lightning:
  trainer:
    accelerator: 'gpu' 
    devices: 4
    log_gpu_memory: all
    max_epochs: 2
    precision: 16
    auto_select_gpus: False
    strategy:
      target: pytorch_lightning.strategies.ColossalAIStrategy
      params:
        use_chunk: False
        enable_distributed_storage: True,
        placement_policy: cuda
        force_outputs_fp32: False
    log_every_n_steps: 2
    logger: True
    default_root_dir: "/tmp/diff_log/"
    profiler: pytorch
  logger_config:
    wandb:
      target: pytorch_lightning.loggers.WandbLogger
      params:
          name: nowname
          save_dir: "/tmp/diff_log/"
          offline: opt.debug
          id: nowname
Fazziekey commented 1 year ago

may be you should download the pretrained model from https://huggingface.co/CompVis/stable-diffusion-v1-4

liuchenbaidu commented 1 year ago

conda env create -f environment.yaml it give ResolvePackageNotFound:

can it run CPU?

GxjGit commented 1 year ago

may be you should download the pretrained model from https://huggingface.co/CompVis/stable-diffusion-v1-4

ok, I am downloading it and trying it again.

GxjGit commented 1 year ago

conda env create -f environment.yaml it give ResolvePackageNotFound:

  • cudatoolkit=11.3
  • libgcc-ng[version='>=9.3.0']
  • __glibc[version='>=2.17']
  • cudatoolkit=11.3
  • libstdcxx-ng[version='>=9.3.0']

can it run CPU?

I ran it in GPU environment.

GxjGit commented 1 year ago

may be you should download the pretrained model from https://huggingface.co/CompVis/stable-diffusion-v1-4

@Fazziekey I have update the pretrained model and code, but encountered the same problem.

How do we comprehend the this problem:

RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.

"What does “size 8” stand for ?

flynnamy commented 1 year ago

Hi,could you please share the detailed link of pretrained model? I only fould some *.ckpt models. @GxjGit

GxjGit commented 1 year ago

Hi,could you please share the detailed link of pretrained model? I only fould some *.ckpt models. @GxjGit Look at this, click the tab of "Files and versions" image

And you can also download by cmd: image

GxjGit commented 1 year ago

may be you should download the pretrained model from https://huggingface.co/CompVis/stable-diffusion-v1-4

@Fazziekey I have update the pretrained model and code, but encountered the same problem.

How do we comprehend the this problem:

RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.

"What does “size 8” stand for ?

@Fazziekey I have soveled my problem. The reason is the incorrect images size. As the size of images in my dataset is different, it repoted "RuntimeError: stack expects each tensor to be equal size, but got [140, 140, 3] at entry 0 and [300, 500, 3] at entry 1", so I resize it in 224 x 224, it report the error as above. when I change to 256 x 256 as the yaml setting, it run successfully.

But I can not find resize operation in the origin code, Are the images in your dataset in the fixed size of 256*256? image

I suggest that a description of the resolution requirements for input dataset images can be added. Anyway, Thanks a lot.

flynnamy commented 1 year ago

Thanks! @GxjGit I download and meet some problems:1.Some weights of the model checkpoint at openai/clip-vit-large-patch14 were not used when initializing CLIPTextModelZero, 2.like followings: image

Fazziekey commented 1 year ago

may be you should download the pretrained model from https://huggingface.co/CompVis/stable-diffusion-v1-4

@Fazziekey I have update the pretrained model and code, but encountered the same problem. How do we comprehend the this problem: RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list. "What does “size 8” stand for ?

@Fazziekey I have soveled my problem. The reason is the incorrect images size. As the size of images in my dataset is different, it repoted "RuntimeError: stack expects each tensor to be equal size, but got [140, 140, 3] at entry 0 and [300, 500, 3] at entry 1", so I resize it in 224 x 224, it report the error as above. when I change to 256 x 256 as the yaml setting, it run successfully.

But I can not find resize operation in the origin code, Are the images in your dataset in the fixed size of 256*256? image

I suggest that a description of the resolution requirements for input dataset images can be added. Anyway, Thanks a lot.

yes,the input image size must be 256*256 for Latency Diffusion