happylittlecat2333 / Auffusion

Official codes and models of the paper "Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation"
https://auffusion.github.io/
Other
148 stars 12 forks source link

Can I control the duration of theText-guided style transfer's output audio? #10

Open hello-xiaow opened 7 months ago

hello-xiaow commented 7 months ago

I tested and found that the duration of the output audio is always 10 seconds. How to modify the code to make the output audio duration consistent with the input audio duration

happylittlecat2333 commented 7 months ago

It is because we first pad the spectrogram (norm_spec) to 1024 before sending it to StableDiffusionImg2ImgPipeline, and we only use the original width of the spectrogram for output. The process can be viewed:

audio, sampling_rate = load_wav(audio_path)
audio, spec = get_mel_spectrogram_from_audio(audio)
norm_spec = normalize_spectrogram(spec)
norm_spec = norm_spec[:,:, width_start:width_start+width]
norm_spec = pad_spec(norm_spec, 1024)

.....

with torch.autocast("cuda"):
    output_spec = pipe(
        prompt=prompt, image=norm_spec, num_inference_steps=100, generator=generator, output_type="pt", strength=strength, guidance_scale=7.5
    ).images[0]

# add to image_list
output_spec = output_spec[:, :, :width]
....

Hence, there are two alternatives. The first one is to use only the first width of the spectrogram as we do. The other option is that you can try not to pad before sending it to the pipeline, although I have not tried it before.