guyyariv / TempoTokens

This repo contains the official PyTorch implementation of: Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation
https://pages.cs.huji.ac.il/adiyoss-lab/TempoTokens/
MIT License
107 stars 11 forks source link

Error for training landsacoe dataset #6

Open jiajiaxiaoskx opened 6 months ago

jiajiaxiaoskx commented 6 months ago

When I run the training code on the landscape dataset, I encounter an error. How should I solve it?

LoRA rank 16 is too large. setting to: 4 Traceback (most recent call last): File "train.py", line 1221, in main(config) File "train.py", line 770, in main unet_lora_params, unet_negation = inject_lora( File "train.py", line 293, in inject_lora params, negation = injector(injector_args) File "/home/TempoTokens/utils/lora.py", line 461, in inject_trainable_lora_extended _tmp.to(_child_module.bias.device).to(_child_module.bias.dtype) AttributeError: 'NoneType' object has no attribute 'device'

Thank you for your answer!

guyyariv commented 6 months ago

Hi, our pre-trained models do not include training with LoRA, thus, I have not encountered this error. Try using the config file to disable LoRA during training (train the adapter only).

jiajiaxiaoskx commented 6 months ago

Thank you! I still don't quite understand the adapter you mentioned. Regarding the landscape and audioset-drum datasets, would you mind telling me which training modules should be set to True in the config file for training?

guyyariv commented 6 months ago

Be sure to set use_unet_lora to False in the config file to disable LoRA training: use_unet_lora: False. However, the adapter is required for training, so you cannot disable it from the config file. Good luck! :)

jiajiaxiaoskx commented 6 months ago

Thank you for your patient reply and excellent work! I encountered several errors while running your training code, including issues with default parameter settings and dataset input. I'm not sure if it's because I don't fully understand the code framework or if there are some issues with the code. Like this error:

File "train.py", line 1056, in main for step, batch in enumerate(train_dataloader): File "/home/anaconda3/envs/protagonist-113/lib/python3.8/site-packages/accelerate/data_loader.py", line 384, in iter current_batch = next(dataloader_iter) File "/home/anaconda3/envs/protagonist-113/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 681, in next data = self._next_data() File "/home/anaconda3/envs/protagonist-113/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 721, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/home/anaconda3/envs/protagonist-113/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/anaconda3/envs/protagonist-113/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/TempoTokens/utils/dataset.py", line 666, in getitem index = random.choice(list(self.valid_videos)) File "/home/anaconda3/envs/protagonist-113/lib/python3.8/random.py", line 290, in choice raise IndexError('Cannot choose from an empty sequence') from None IndexError: Cannot choose from an empty sequence

Could you possibly provide a final version of the code you used for training on the audioset-drum or landscape datasets (including the config file)? I would be very grateful! If it's not convenient, I can communicate with you via email. Thank you!

guyyariv commented 6 months ago

It looks like you didn't load the datasets as required; they should be split into an audio folder and a video folder. It appears you tried to load them from an empty sequence. The current code should run without any issues. Feel free to reach out to me via email at guyyariv.mail at gmail dot com

jiajiaxiaoskx commented 6 months ago

Following your suggestions, I attempted to replicate the process and conducted experiments on the Landscape and Audioset-Drum datasets (I did not change the provided config files). However, my results have been less than satisfactory. Below are the changes I made to the code:

The audio data is stereo, so the input dimension is [2, 16000]. I noticed that in your dataset code, the audio input dimension is set as [1, 16000], so I performed a simple average on the first dimension. I imported randn_tensor from diffusers.utils.torch_utils. After making these changes, I was able to successfully complete the training, but the validation results were very poor. The outcomes were significantly inferior compared to the results showcased on your project page and those generated using your provided pretrained model. I'm not sure if I have made a mistake somewhere or if you employed some sophisticated strategies during your training. Or I should change the parameters in the config file. I would like to seek your advice on this matter.

guyyariv commented 6 months ago

Hi, I'm not sure why you cannot reconstruct the Landscape and Audioset-Drum results. These are both easy datasets (less challenging than VGGSound, for example), and the model should converge in high quality and quickly when using them. I used the provided version of those datasets (as mentioned in the README, for example, https://drive.google.com/drive/folders/14A1zaQI5EfShlv3QirgCGeNFzZBzQ3lq is Landscape) and split them into video alone and audio alone (mono), then used the provided config file. Please ask ChatGPT to split them for you into two different folders. Then, try to train again. Let me know if it is improved now.

jiajiaxiaoskx commented 6 months ago

Hello, thank you for your patient reply! I still have a few questions regarding the code implementation that I would like to confirm with you:

1、The original video sizes for the three datasets—landscape, audioset-drum, and vggsound—are different (landscape is 288x512, audioset-drum is 96x96, vggsound is 360x212). However, you have set the video size for training and inference in the three config files as 384x384 and used a bucketing strategy. Should I change this parameter, or should I follow your setting and standardize it to 384x384? 2、The audio data for these datasets is in stereo. In the process of converting it to mono, should I average the data from both channels, or should I use the data from the first or the second channel? 3、Will changing the batch size during training affect the training results (can I change it to 2)?

I look forward to your reply!