mlvlab / DDMI

Official Implementation (Pytorch) of "DDMI: Domain-Agnostic Latent Diffusion Models for Synthesizing High-Quality Implicit Neural Representations", ICLR 2024
MIT License
21 stars 4 forks source link

Issues running the code on Sky_timelapse Dataset #5

Closed skrya closed 4 months ago

skrya commented 4 months ago

Dear authors,

Thank you for providing this wonderful code base. I tried running the code to generate videos using the sky_timelapse dataset. However, I encountered the following error:

I am trying to run the first stage with the following command.

CUDA_VISIBLE_DEVICES=0,1 accelerate launch --multi_gpu --num_processes=2 main.py --exp d2c-vae --configs configs/d2c-vae/skytimelapse_gan.yaml

'/home/sudhir/anaconda3/envs/vid/lib/python3.8/site-packages/torch/nn/modules/conv.py:456: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.) return F.conv2d(input, weight, bias, self.stride, 0%| | 0/600 00:01<?, ?it/s: Traceback (most recent call last): rank1: File "main.py", line 65, in

rank1: File "main.py", line 27, in main

rank1: File "/home/sudhir/Projects/Adobe/DDMI/exp/stage.py", line 151, in first_stage_train

rank1: File "/home/sudhir/Projects/Adobe/DDMI/tools/d2c_vae/video.py", line 212, in train rank1: inputs_2d = torch.gather(x, 2, frame_idx_selected).squeeze(2) rank1: RuntimeError: Size does not match at dimension 1 expected index [1, 64, 1, 256, 256] to be smaller than self [1, 3, 16, 256, 256] apart from dimension 2'

Could you please advise on how I can fix this issue and proceed forward?

Additionally, could you let me know the training time for the sky_timelapse dataset and the number and specifications (GB) of the GPUs used?

Thanks?

DogyunPark commented 4 months ago

Thanks for your interest in our work. I have fixed the dimension error, and it should work properly now. If you have additional problems, please let me know.

Regarding the training time and resources, the first-step training took up to 3~4 days for sky timelapse on 4 A100 GPUS (80GB).

skrya commented 4 months ago

Thanks for response. I will take a look at it. There was a bug during evaluation of sky_timelapse during training. The eval code did not have suitable code for video (some lines were missing). Hope that was also fixed?

DogyunPark commented 4 months ago

Thank you for letting me know. We have fixed several issues regarding the video.