openai / jukebox

Code for the paper "Jukebox: A Generative Model for Music"
https://openai.com/blog/jukebox/
Other
7.83k stars 1.41k forks source link

AssertionError: torch.Size([1, 6528, 2048]) != (1, 6321, 2048) nor (1, 1, 2048). Did you pass the correct --sample_length? #259

Closed ElizavetaSedova closed 2 years ago

ElizavetaSedova commented 2 years ago

Hello, please help! I want to fine-tuning by adding new audio files that fail to generate on the 1b_model to get good samples for similar audios later. All my audio files are 44100Hz stereo. I'm not exactly sure what exactly these parameters are needed for training. I am following the instructions: Fine-tune pre-trained top-level prior to new style(s). When I run:

mpiexec -n {ngpus} python jukebox/train.py --hps=vqvae,prior_1b_lyrics,all_fp16,cpu_ema --name=finetuned \
--sample_length=1048576 --bs=1 --aug_shift --aug_blend --audio_files_dir={audio_files_dir} \
--labels=True --train --test --prior --levels=3 --level=2 --weight_decay=0.01 --save_iters=1000

i get an error:

assert hps.sample_length / hps.sr < self.min_duration, f'Sample length {hps.sample_length} per sr {hps.sr} ({hps.sample_length / hps.sr:.2f}) should be shorter than min duration {self.min_duration}'
AssertionError: Sample length 1048576 per sr 44100 (23.78) should be shorter than min duration 17.84

As I understand it, I need to decrease this value. But if I decrease the value of sample_length, then this error appears:

assert x_cond.shape == (N, D, self.width) or x_cond.shape == (N, 1, self.width), f"{x_cond.shape} != {(N, D, self.width)} nor {(N, 1, self.width)}. Did you pass the correct --sample_length?"
AssertionError: torch.Size([1, 6528, 2048]) != (1, 6321, 2048) nor (1, 1, 2048). Did you pass the correct --sample_length?

I don't understand what I need to do to make everything work without errors...

ElizavetaSedova commented 2 years ago

I put limits on the minimum and maximum duration, this solved the problem.

Theehawau commented 1 year ago

I have the same error but putting limits on minimum and maximum duration isn't solving the problem. Can you give more insight on the values you used and how you chose them?