RuntimeError: Calculated padded input size per channel: (6). Kernel size: (7). Kernel size can't be greater than actual input size

jasonppy / VoiceCraft

Zero-Shot Speech Editing and Text-to-Speech in the Wild

Other

7.52k stars 739 forks source link

RuntimeError: Calculated padded input size per channel: (6). Kernel size: (7). Kernel size can't be greater than actual input size #50

Open lukaszliniewicz opened 6 months ago

lukaszliniewicz commented 6 months ago

I'm getting this error regardless of the wav file I use, including the demo file:

RuntimeError: Calculated padded input size per channel: (6). Kernel size: (7). Kernel size can't be greater than actual input size

Have you encountered this before?

jasonppy commented 6 months ago

Is it happening at

    concat_sample = audio_tokenizer.decode(
        [(concat_frames, None)] # [1,T,8] -> [1,8,T]
    )
    gen_sample = audio_tokenizer.decode(
        [(gen_frames, None)]
    )

that means the model generated too few tokens. this shouldn't happen with the well-trained model unless the prompt is extreme hard such as inaudible speech or extreme noise. Plus I already have code that prevent this from happening (search if cur_num_gen <= self.args.encodec_sr // 5 in ./models/voicecraft.py

To verify whether the model only output a few tokens, print gen_frames, it should be [1, 4, T], T should be n sec * 50, i.e. if you expect the output to be longer than 2 sec, T should be bigger than 100

lukaszliniewicz commented 6 months ago

Apologies, it was my own stupidity. I didn't realize that you need to include the transcript of the sample in the prompt! Some generations worked (if the wav sample used was very short, <1s, which got me really confused), and finally I noticed that in the notebook example you prepend the transcript to the prompt...

Anyhow, I got this working natively on Windows with some minor adjustments to path handling in several audiocraft files. I will do some more testing, but eventually I would like to include VoiceCraft as an option in my audiobook generator app (https://github.com/lukaszliniewicz/Pandrator), which is of course non-commercial and open source, as I received a request for it.