Can I control the duration of theText-guided style transfer's output audio?

It is because we first pad the spectrogram (norm_spec) to 1024 before sending it to StableDiffusionImg2ImgPipeline, and we only use the original width of the spectrogram for output. The process can be viewed:

audio, sampling_rate = load_wav(audio_path)
audio, spec = get_mel_spectrogram_from_audio(audio)
norm_spec = normalize_spectrogram(spec)
norm_spec = norm_spec[:,:, width_start:width_start+width]
norm_spec = pad_spec(norm_spec, 1024)

.....

with torch.autocast("cuda"):
    output_spec = pipe(
        prompt=prompt, image=norm_spec, num_inference_steps=100, generator=generator, output_type="pt", strength=strength, guidance_scale=7.5
    ).images[0]

# add to image_list
output_spec = output_spec[:, :, :width]
....

Hence, there are two alternatives. The first one is to use only the first width of the spectrogram as we do. The other option is that you can try not to pad before sending it to the pipeline, although I have not tried it before.

happylittlecat2333 / Auffusion

Can I control the duration of theText-guided style transfer's output audio? #10