hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.63k stars 4.33k forks source link

[BUG]: How can run examples/images/diffusion with use_ema #1966

Open GxjGit opened 1 year ago

GxjGit commented 1 year ago

🐛 Describe the bug

I can successfully ran the exampls with default setting. according to https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/diffusion

When I change the value of use_ema from False to True, error occurred: image

what would be the reason for this problem? Thanks.

log info :

Project config
model:
  base_learning_rate: 0.0001
  target: ldm.models.diffusion.ddpm.LatentDiffusion
  params:
    linear_start: 0.00085
    linear_end: 0.012
    num_timesteps_cond: 1
    log_every_t: 200
    timesteps: 1000
    first_stage_key: image
    cond_stage_key: caption
    image_size: 64
    channels: 4
    cond_stage_trainable: false
    conditioning_key: crossattn
    monitor: val/loss_simple_ema
    scale_factor: 0.18215
    use_ema: true
    scheduler_config:
      target: ldm.lr_scheduler.LambdaLinearScheduler
      params:
        warm_up_steps:
        - 1
        cycle_lengths:
        - 10000000000000
        f_start:
        - 1.0e-06
        f_max:
        - 0.0001
        f_min:
        - 1.0e-10
    unet_config:
      target: ldm.modules.diffusionmodules.openaimodel.UNetModel
      params:
        image_size: 32
        from_pretrained: /home//data/stable-diffusion-v1-4/unet/diffusion_pytorch_model.bin
        in_channels: 4
        out_channels: 4
        model_channels: 320
        attention_resolutions:
        - 4
        - 2
        - 1
        num_res_blocks: 2
        channel_mult:
        - 1
        - 2
        - 4
        - 4
        num_heads: 8
        use_spatial_transformer: true
        transformer_depth: 1
        context_dim: 768
        use_checkpoint: false
        legacy: false
        use_fp16: true
    first_stage_config:
      target: ldm.models.autoencoder.AutoencoderKL
      params:
        embed_dim: 4
        from_pretrained: /home//data/stable-diffusion-v1-4/vae/diffusion_pytorch_model.bin
        monitor: val/rec_loss
        ddconfig:
          double_z: true
          z_channels: 4
          resolution: 256
          in_channels: 3
          out_ch: 3
          ch: 128
          ch_mult:
          - 1
          - 2
          - 4
          - 4
          num_res_blocks: 2
          attn_resolutions: []
          dropout: 0.0
        lossconfig:
          target: torch.nn.Identity
    cond_stage_config:
      target: ldm.modules.encoders.modules.FrozenCLIPEmbedder
      params:
        use_fp16: true
    use_fp16: true
data:
  target: main.DataModuleFromConfig
  params:
    batch_size: 16
    wrap: false
    train:
      target: ldm.data.base.Txt2ImgIterableBaseDataset
      params:
        file_path: /home/notebook/data/group/huangxin/laion-400m/e-commerce/e-commerce-0.tsv
        world_size: 1
        rank: 0

Lightning config
trainer:
  accelerator: gpu
  devices: 1
  log_gpu_memory: all
  max_epochs: 2
  precision: 16
  auto_select_gpus: false
  strategy:
    target: pytorch_lightning.strategies.ColossalAIStrategy
    params:
      use_chunk: false
      enable_distributed_storage: True,
      placement_policy: cuda
      force_outputs_fp32: false
  log_every_n_steps: 2
  logger: true
  default_root_dir: /tmp/diff_log/
  profiler: pytorch
logger_config:
  wandb:
    target: pytorch_lightning.loggers.WandbLogger
    params:
      name: nowname
      save_dir: /tmp/diff_log/
      offline: opt.debug
      id: nowname

samples11 in dataset 2139828
samples11 in dataset 2139828
samples11 in dataset 2139828
samples11 in dataset 2139828
samples11 in dataset 2139828
samples11 in dataset 2139828
samples11 in dataset 2139828
samples11 in dataset 2139828
Epoch 0:   0%|          | 0/133740 [00:00<?, ?it/s] samples11 in dataset 2139828
samples11 in dataset 2139828

/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:233: UserWarning: You called `self.log('global_step', ...)` in your `training_step` but the value needs to be floating point. Converting it to torch.float32.
  warning_cache.warn(
samples11 in dataset 2139828
samples11 in dataset 2139828
[11/16/22 11:35:24] INFO     colossalai - colossalai - INFO:                    
                             /opt/conda/envs/ldm/lib/python3.9/site-packages/col
                             ossalai/zero/zero_optimizer.py:137 step            
                    INFO     colossalai - colossalai - INFO: Found overflow.    
                             Skip step                                          
/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
Epoch 0:   0%|          | 1/133740 [00:49<1833:05:15, 49.34s/it, loss=0.175, v_num=0, train/loss_simple_step=0.175, train/loss_vlb_step=0.0018, train/loss_step=0.175, global_step=0.000, lr_abs=1.6e-9]Summoning checkpoint.
[11/16/22 11:35:27] INFO     colossalai - ProcessGroup - INFO:                  
                             /opt/conda/envs/ldm/lib/python3.9/site-packages/col
                             ossalai/tensor/process_group.py:24 get             
                    INFO     colossalai - ProcessGroup - INFO: NCCL initialize  
                             ProcessGroup on [0]                                
FIT Profiler Report
Profile stats for: records
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                     Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
    cudaDeviceSynchronize       100.00%      71.000us       100.00%      71.000us      71.000us             1  
-------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 71.000us

Traceback (most recent call last):
  File "/home//code/ColossalAI/examples/images/diffusion/main.py", line 817, in <module>
    trainer.fit(model, data)
  File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in fit
    call._call_and_handle_interrupt(
  File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 90, in launch
    return function(*args, **kwargs)
  File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 621, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in _run
    results = self._run_stage()
  File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1137, in _run_stage
    self._run_train()
  File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1160, in _run_train
    self.fit_loop.run()
  File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 231, in advance
    self.trainer._call_lightning_module_hook("on_train_batch_end", batch_end_outputs, batch, batch_idx)
  File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1302, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home//code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 436, in on_train_batch_end
    self.model_ema(self.model)
  File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home//code/ColossalAI/examples/images/diffusion/ldm/modules/ema.py", line 42, in forward
    shadow_params[sname].sub_(one_minus_decay * (shadow_params[sname] - m_param[key]))
AttributeError: 'NoneType' object has no attribute 'sub_'

Environment

image

Fazziekey commented 1 year ago

Thanks for your issue, we will fix the bug as soon as we can

ray0809 commented 1 year ago

If cond_stage_trainable = True, it will also report an error

│ /opt/conda/lib/python3.7/site-packages/colossalai/gemini/chunk/manager.py:159 in get_chunk 
│                                                                                            
│   156  Args:                                                                        
│   157 tensor (torch.Tensor): a torch tensor object                             
│   158 """                                                                          
│ ❱ 159 return self.tensor_chunk_map[tensor]                                         
│   160                                                                                 
│   161 def get_cuda_movable_chunks(self) -> List[Chunk]:                                
│   162   """                                                                          
╰───────────────────────────────────────────────────────
KeyError: ColoParameter: ColoTensor:
Parameter containing:
Parameter(ColoParameter([[ 4.2009e-04, -3.7899e-03,  3.8624e-03,  ..., -8.2350e-04,
                 1.2369e-03,  5.8413e-04],
               [ 3.8624e-04, -1.3628e-03,  2.3880e-03,  ..., -7.9250e-04,
                 2.1076e-03,  1.0943e-04],
               [ 1.2493e-03,  9.7466e-04,  1.9093e-03,  ...,  1.4000e-03,
                 1.1845e-03, -9.9087e-04],
               ...,
               [-1.3588e-02, -1.8244e-03,  8.0872e-03,  ...,  5.8174e-03,
                -1.0162e-02, -3.7980e-04],
               [-1.0368e-02,  6.7711e-03,  1.0557e-03,  ...,  1.1563e-05,
                -9.3384e-03, -1.8854e-03],
               [-1.7729e-03, -1.2070e-02, -1.2665e-02,  ...,  9.3079e-03,
                 6.6338e-03, -6.0425e-03]], device='cuda:1',
              dtype=torch.float16))
DistSpec:
        placement: DistPlacementPattern.REPLICATE
ProcessGroup:
        Rank: 0, World size: 1, DP degree: 1, TP degree: 1
        Ranks in group: [0]
None
GxjGit commented 1 year ago

@Fazziekey Hi, have you fixed this problem?

Fazziekey commented 1 year ago

@Fazziekey Hi, have you fixed this problem?

Thanks for your issue, Now, we don't support con-stage training, we will support it in the future.

BoyuanJiang commented 1 year ago

@Fazziekey Hi, have you fixed this problem?

Thanks for your issue, Now, we don't support con-stage training, we will support it in the future.

is it supported now?

flybird11111 commented 1 year ago

not yet