Open lukaszliniewicz opened 6 months ago
Is it happening at
concat_sample = audio_tokenizer.decode(
[(concat_frames, None)] # [1,T,8] -> [1,8,T]
)
gen_sample = audio_tokenizer.decode(
[(gen_frames, None)]
)
that means the model generated too few tokens. this shouldn't happen with the well-trained model unless the prompt is extreme hard such as inaudible speech or extreme noise. Plus I already have code that prevent this from happening (search if cur_num_gen <= self.args.encodec_sr // 5
in ./models/voicecraft.py
To verify whether the model only output a few tokens, print gen_frames
, it should be [1, 4, T], T should be n sec * 50, i.e. if you expect the output to be longer than 2 sec, T should be bigger than 100
Apologies, it was my own stupidity. I didn't realize that you need to include the transcript of the sample in the prompt! Some generations worked (if the wav sample used was very short, <1s, which got me really confused), and finally I noticed that in the notebook example you prepend the transcript to the prompt...
Anyhow, I got this working natively on Windows with some minor adjustments to path handling in several audiocraft files. I will do some more testing, but eventually I would like to include VoiceCraft as an option in my audiobook generator app (https://github.com/lukaszliniewicz/Pandrator), which is of course non-commercial and open source, as I received a request for it.
I'm getting this error regardless of the wav file I use, including the demo file:
RuntimeError: Calculated padded input size per channel: (6). Kernel size: (7). Kernel size can't be greater than actual input size
Have you encountered this before?