Unconditional training is much slower than the same network from denoising-diffusion-pytorch

adelacvg commented 2 years ago

I am training the unconditional version of imagen, which I guess is the same as continuous time Gaussian diffusion. But I found that training on the unconditional imagen is much slower than continuous_time_gaussian_diffusion. Both are trained on the same dataset and devices. However, the imagen version is 5-10 times slower. Trained in the same steps, imagen gives worse result than continuous_time_gaussian_diffusion. I would like it to produce comparable results in the same amount of time. How should I configure the network correctly?

model = Unet(
    dim = 64,
    dim_mults = (1, 2, 4, 8),
    random_fourier_features=True
).cuda()

diffusion = ContinuousTimeGaussianDiffusion(
    model,
    image_size=64,
    num_sample_steps=500,
    loss_type = 'l1',
).cuda()

trainer = Trainer(
    diffusion,
    'FFHQ/lq',
    train_batch_size = 32,
    train_lr = 8e-5,
    train_num_steps = 700000,         # total training steps
    gradient_accumulate_every = 2,    # gradient accumulation steps
    ema_decay = 0.995,                # exponential moving average decay
    amp = False,                        # turn on mixed precision
    results_folder='results_continue'
)

trainer.train()

# unet for imagen
unet1 = BaseUnet64(
    dim = 64,
    dim_mults = (1, 2, 4, 8),
    num_resnet_blocks = 3,
    layer_attns = (False, False, False, True),
    layer_cross_attns = (False, False, False, True),
    attn_heads = 4,
    ff_mult = 2.,
    memory_efficient = False
)

unet2 = SRUnet256()

# imagen, which contains the unets above (base unet and super resoluting ones)

imagen = Imagen(
    condition_on_text = False,   # this must be set to False for unconditional Imagen
    unets = (unet1, unet2),
    image_sizes = (64, 256),
    timesteps = 250
)

trainer = ImagenTrainer(
        imagen,
        split_valid_from_train = True,
).cuda()
dataset = Dataset('FFHQ/lq', image_size = 256)

trainer.add_train_dataset(dataset, batch_size = 32)

i=0
with tqdm(initial = 0, total = 200000, disable = not trainer.is_main) as pbar:
# for i in tqdm(range(200000)):
    while i<200000:
        loss = trainer.train_step(unet_number = 1, max_batch_size = 4)
        pbar.set_description(f'loss: {loss:.4f}')
        if not (i % 500):
            valid_loss = trainer.valid_step(unet_number = 1, max_batch_size = 4)
            print(f'valid loss: {valid_loss}')

        if not (i % 1000) and trainer.is_main: # is_main makes sure this can run in distributed
            images = trainer.sample(batch_size = 25, stop_at_unet_number=1, return_pil_images = True) # returns List[Image]
            image_grid(images,5,5).save(f'./SR_results/sample-{i // 1000}.png')
        if not(i%1000):
            trainer.save(f'./SR_results/model_{i}.pt')
        i+=1
        pbar.update(1)

lucidrains commented 2 years ago

that's interesting, I'm not sure

there is a subtle difference in the resnet blocks. I'm using the GLIDE style architecture here with norm, activation, then project

However, the original ddpm does project, norm, activation

I've also added weight standardization in the ddpm pytorch repo since it reportedly works well with group norms, so it could be that too

santisy commented 1 year ago

@lucidrains @adelacvg Is this because in the pointed DDPM implementation, the UNnet has default attention layers every level while in this repo, you have to point to which layer should have attention layers in arguments.

santisy commented 1 year ago

@lucidrains Have you evalute how these two repos perform on unconditional generation tasks? I appreciate your contributions very much and hope get more hints on this.

lucidrains / imagen-pytorch

Unconditional training is much slower than the same network from denoising-diffusion-pytorch #261