lucidrains / imagen-pytorch

Implementation of Imagen, Google's Text-to-Image Neural Network, in Pytorch
MIT License
8.02k stars 758 forks source link

video synthesis using elucidatedImagen #245

Open Feanor007 opened 1 year ago

Feanor007 commented 1 year ago

Has anyone successfully trained the model on video synthesis (video to video, no conditioning)? I have trained my model for 10K step and still got pretty bad results.

I am currently only trying to generate a 64 64 32 sized video, no upscale involved. The data is auto-normalized and I got loss values around 0.1 Is this normal?

Here is the code for training:

unet1 = Unet3D(dim = 32, channels = 1, dim_mults = (1, 2, 4, 8)).cuda()
unet2 = Unet3D(dim = 32, channels = 1, dim_mults = (1, 2, 4, 8)).cuda()

imagen = ElucidatedImagen(
    condition_on_text = False,
    unets = (unet1, unet2),
    channels = 1,
    image_sizes = (64, 128),
    random_crop_sizes = (None, 16),
    num_sample_steps = 200,
    cond_drop_prob = 0.1,
    sigma_min = 0.002,                          
    sigma_max = (80, 160),                      
    sigma_data = 0.5,                           
    rho = 7,                                    
    P_mean = -1.2,                             
    P_std = 1.2,                                
    S_churn = 80,                               
    S_tmin = 0.05,
    S_tmax = 50,
    S_noise = 1.003,
).cuda()

trainer = ImagenTrainer(
    imagen = imagen,
    split_valid_from_train = True # whether to split the validation dataset from the training
).cuda()

dataset = MyDataset(folder='/home/zeyu/Anime/img/', image_size = 64, frame =32)
trainer.add_train_dataset(dataset, batch_size = 1)

for i in tqdm(range(200000)):
    loss = trainer.train_step(unet_number = 1, max_batch_size = 4)
    print(f'loss: {loss}')

    if not (i % 500):
        valid_loss = trainer.valid_step(unet_number = 1, max_batch_size = 4)
        print(f'valid loss: {valid_loss}')

Here is the code for inference:

videos = trainer.sample(batch_size = 1, init_images = dataset[1][0,:,:,:], stop_at_unet_number = 1, video_frames = 32, return_pil_images = True) 
video_tensor_to_gif(videos[0][0,:,:,:,:], f'./output_imgs/sample-{ck}-wInit.gif')

Many thanks in advance!

pgarz commented 1 year ago

From what it looks like, you're not training the second Unet. You have to make separate calls to trainer.train_step(unet_number = 2 max_batch_size = 4) to train the up-scaling unet

Feanor007 commented 1 year ago

From what it looks like, you're not training the second Unet. You have to make separate calls to trainer.train_step(unet_number = 2 max_batch_size = 4) to train the up-scaling unet

Thank you for your reply. But at this stage, I only intend to train the first U-Net to get a 64 64 32 video. I did turn off the second U-Net during inference by setting stop_at_unet_number = 1