Closed GxjGit closed 1 year ago
Hi @GxjGit Thank you for your feedback. We will try to reproduce your issue and fix it soon.
can you upload your train.yaml
I have not modified the yaml setting.
In addition, As I can't find the training model, I have annotated the code of pretrained model loading of UNetModel and AutoencoderKL. I have no idea if it is concerned.
model:
base_learning_rate: 1.0e-04
target: ldm.models.diffusion.ddpm.LatentDiffusion
params:
linear_start: 0.00085
linear_end: 0.0120
num_timesteps_cond: 1
log_every_t: 200
timesteps: 1000
first_stage_key: image
cond_stage_key: caption
image_size: 64
channels: 4
cond_stage_trainable: false # Note: different from the one we trained before
conditioning_key: crossattn
monitor: val/loss_simple_ema
scale_factor: 0.18215
use_ema: False
scheduler_config: # 10000 warmup steps
target: ldm.lr_scheduler.LambdaLinearScheduler
params:
warm_up_steps: [ 1 ] # NOTE for resuming. use 10000 if starting from scratch
cycle_lengths: [ 10000000000000 ] # incredibly large number to prevent corner cases
f_start: [ 1.e-6 ]
f_max: [ 1.e-4 ]
f_min: [ 1.e-10 ]
unet_config:
target: ldm.modules.diffusionmodules.openaimodel.UNetModel
params:
image_size: 32 # unused
from_pretrained: '/data/scratch/diffuser/stable-diffusion-v1-4/unet/diffusion_pytorch_model.bin'
in_channels: 4
out_channels: 4
model_channels: 320
attention_resolutions: [ 4, 2, 1 ]
num_res_blocks: 2
channel_mult: [ 1, 2, 4, 4 ]
num_heads: 8
use_spatial_transformer: True
transformer_depth: 1
context_dim: 768
use_checkpoint: False
legacy: False
first_stage_config:
target: ldm.models.autoencoder.AutoencoderKL
params:
embed_dim: 4
from_pretrained: '/data/scratch/diffuser/stable-diffusion-v1-4/vae/diffusion_pytorch_model.bin'
monitor: val/rec_loss
ddconfig:
double_z: true
z_channels: 4
resolution: 256
in_channels: 3
out_ch: 3
ch: 128
ch_mult:
- 1
- 2
- 4
- 4
num_res_blocks: 2
attn_resolutions: []
dropout: 0.0
lossconfig:
target: torch.nn.Identity
cond_stage_config:
target: ldm.modules.encoders.modules.FrozenCLIPEmbedder
params:
use_fp16: True
data:
target: main.DataModuleFromConfig
params:
batch_size: 64
wrap: False
train:
target: ldm.data.base.Txt2ImgIterableBaseDataset
params:
file_path: "/home/notebook/data/group/huangxin/laion-400m/e-commerce/e-commerce-0.tsv"
world_size: 1
rank: 0
lightning:
trainer:
accelerator: 'gpu'
devices: 4
log_gpu_memory: all
max_epochs: 2
precision: 16
auto_select_gpus: False
strategy:
target: pytorch_lightning.strategies.ColossalAIStrategy
params:
use_chunk: False
enable_distributed_storage: True,
placement_policy: cuda
force_outputs_fp32: False
log_every_n_steps: 2
logger: True
default_root_dir: "/tmp/diff_log/"
profiler: pytorch
logger_config:
wandb:
target: pytorch_lightning.loggers.WandbLogger
params:
name: nowname
save_dir: "/tmp/diff_log/"
offline: opt.debug
id: nowname
may be you should download the pretrained model from https://huggingface.co/CompVis/stable-diffusion-v1-4
conda env create -f environment.yaml it give ResolvePackageNotFound:
can it run CPU?
may be you should download the pretrained model from https://huggingface.co/CompVis/stable-diffusion-v1-4
ok, I am downloading it and trying it again.
conda env create -f environment.yaml it give ResolvePackageNotFound:
- cudatoolkit=11.3
- libgcc-ng[version='>=9.3.0']
- __glibc[version='>=2.17']
- cudatoolkit=11.3
- libstdcxx-ng[version='>=9.3.0']
can it run CPU?
I ran it in GPU environment.
may be you should download the pretrained model from https://huggingface.co/CompVis/stable-diffusion-v1-4
@Fazziekey I have update the pretrained model and code, but encountered the same problem.
How do we comprehend the this problem:
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.
"What does “size 8” stand for ?
Hi,could you please share the detailed link of pretrained model? I only fould some *.ckpt models. @GxjGit
Hi,could you please share the detailed link of pretrained model? I only fould some *.ckpt models. @GxjGit Look at this, click the tab of "Files and versions"
And you can also download by cmd:
may be you should download the pretrained model from https://huggingface.co/CompVis/stable-diffusion-v1-4
@Fazziekey I have update the pretrained model and code, but encountered the same problem.
How do we comprehend the this problem:
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.
"What does “size 8” stand for ?
@Fazziekey I have soveled my problem. The reason is the incorrect images size. As the size of images in my dataset is different, it repoted "RuntimeError: stack expects each tensor to be equal size, but got [140, 140, 3] at entry 0 and [300, 500, 3] at entry 1", so I resize it in 224 x 224, it report the error as above. when I change to 256 x 256 as the yaml setting, it run successfully.
But I can not find resize operation in the origin code, Are the images in your dataset in the fixed size of 256*256?
I suggest that a description of the resolution requirements for input dataset images can be added. Anyway, Thanks a lot.
Thanks! @GxjGit I download and meet some problems:1.Some weights of the model checkpoint at openai/clip-vit-large-patch14 were not used when initializing CLIPTextModelZero, 2.like followings:
may be you should download the pretrained model from https://huggingface.co/CompVis/stable-diffusion-v1-4
@Fazziekey I have update the pretrained model and code, but encountered the same problem. How do we comprehend the this problem: RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list. "What does “size 8” stand for ?
@Fazziekey I have soveled my problem. The reason is the incorrect images size. As the size of images in my dataset is different, it repoted "RuntimeError: stack expects each tensor to be equal size, but got [140, 140, 3] at entry 0 and [300, 500, 3] at entry 1", so I resize it in 224 x 224, it report the error as above. when I change to 256 x 256 as the yaml setting, it run successfully.
But I can not find resize operation in the origin code, Are the images in your dataset in the fixed size of 256*256?
I suggest that a description of the resolution requirements for input dataset images can be added. Anyway, Thanks a lot.
yes,the input image size must be 256*256 for Latency Diffusion
🐛 Describe the bug
I ran the example of diffusion according to https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/diffusion: steps: conda env create -f environment.yaml conda activate ldm pip install colossalai==0.1.10+torch1.11cu11.3 -f https://release.colossalai.org git clone https://github.com/Lightning-AI/lightning && cd lightning && git reset --hard b04a7aa pip install -r requirements.txt && pip install .
dataset: laion-400m
run: bash train.sh
failed info:
/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py:438: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check. rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.") Traceback (most recent call last): File "/home/code/ColossalAI/examples/images/diffusion/main.py", line 811, in
trainer.fit(model, data)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in fit
call._call_and_handle_interrupt(
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, *kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 621, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in _run
results = self._run_stage()
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1137, in _run_stage
self._run_train()
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1160, in _run_train
self.fit_loop.run()
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(args, kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 214, in advance
batch_output = self.batch_loop.run(kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, *kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
outputs = self.optimizer_loop.run(optimizers, kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(args, kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 200, in advance
result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 247, in _run_optimization
self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 357, in _optimizer_step
self.trainer._call_lightning_module_hook(
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1302, in _call_lightning_module_hook
output = fn(args, kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1661, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 368, in optimizer_step
return self.precision_plugin.optimizer_step(
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/colossalai.py", line 74, in optimizer_step
closure_result = closure()
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 147, in call
self._result = self.closure(args, kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 133, in closure
step_output = self._step_fn()
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 406, in _training_step
training_step_output = self.trainer._call_strategy_hook("training_step", kwargs.values())
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1440, in _call_strategy_hook
output = fn(args, kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 352, in training_step
return self.model(*args, kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, *kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 241, in forward
outputs = self.module(args, kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, kwargs)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/overrides/base.py", line 98, in forward
output = self._forward_module.training_step(*inputs, *kwargs)
File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 411, in training_step
loss, loss_dict = self.shared_step(batch)
File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 976, in shared_step
loss = self(x, c)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(input, kwargs)
File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 988, in forward
return self.p_losses(x, c, t, args, kwargs)
File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 1122, in p_losses
model_output = self.apply_model(x_noisy, t, cond)
File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 1094, in apply_model
x_recon = self.model(x_noisy, t, cond)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(input, kwargs)
File "/home/code/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 1519, in forward
out = self.diffusion_model(x, t, context=cc)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, *kwargs)
File "/home/code/ColossalAI/examples/images/diffusion/ldm/modules/diffusionmodules/openaimodel.py", line 927, in forward
h = th.cat([h, hs.pop()], dim=1)
File "/opt/conda/envs/ldm/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 170, in __torch_function__
ret = func(args, kwargs)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 7 for tensor number 1 in the list.**
Environment