shivammehta25 / Matcha-TTS

[ICASSP 2024] 🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching
https://shivammehta25.github.io/Matcha-TTS/
MIT License
751 stars 98 forks source link

Audio Generation Support In Tensorboard #24

Closed CreativeSelf0 closed 1 year ago

CreativeSelf0 commented 1 year ago

I'm wondering if you already support audio synthesis after each number of steps, as that would be helpful to notice model convergence.

lpscr commented 1 year ago

hi @shivammehta25 I just wanted to say a big thank you for your fantastic work and for sharing this amazing tool! Today, I've been playing around with it, and I discovered that having audio in TensorBoard is incredibly helpful for tracking your model's progress. i am not sure if the best anyway i say to share if someone want to use it

hi @CreativeSelf0 here how you can do this you ask i don't test a lot look like working good but i am not sure if the best method i am very new on ai stuff .

first go to Matcha-TTS/matcha/models/baselightningmodule.py file open to edit

log = utils.get_pylogger(name) https://github.com/shivammehta25/Matcha-TTS/blob/c8d0d60f87147fe340f4627b84588e812e5fbb00/matcha/models/baselightningmodule.py#L16

write this script in line 17 it's helper to generate audio this code you can find also in cli.py some parts

#---- code from cli.py ----
import os
from matcha.utils.utils import get_user_data_dir
from matcha.hifigan.config import v1
from matcha.hifigan.denoiser import Denoiser
from matcha.hifigan.env import AttrDict
from matcha.hifigan.models import Generator as HiFiGAN
import torch

def load_hifigan(checkpoint_path, device):
    h = AttrDict(v1)
    hifigan = HiFiGAN(h).to(device)
    hifigan.load_state_dict(torch.load(checkpoint_path, map_location=device)["generator"])
    _ = hifigan.eval()
    hifigan.remove_weight_norm()
    return hifigan

def load_vocoder(vocoder_name, checkpoint_path, device):
    print(f"[!] Loading {vocoder_name}!")
    vocoder = None
    if vocoder_name in ("hifigan_T2_v1", "hifigan_univ_v1"):
        vocoder = load_hifigan(checkpoint_path, device)
    else:
        raise NotImplementedError(
            f"Vocoder {vocoder_name} not implemented! define a load_<<vocoder_name>> method for it"
        )

    denoiser = Denoiser(vocoder, mode="zeros")
    print(f"[+] {vocoder_name} loaded!")
    return vocoder, denoiser

def to_waveform(mel, vocoder, denoiser=None):

    audio = vocoder(mel).clamp(-1, 1)
    if denoiser is not None:
        audio = denoiser(audio.squeeze(), strength=0.00025).cpu().squeeze()

    return audio.cpu().squeeze()

vocoder_name="hifigan_T2_v1" # here you can use also method with 'hifigan_univ_v1' 
checkpoint_path=get_user_data_dir()
file_vocoder=os.path.join(checkpoint_path,vocoder_name)
vocoder, denoiser = load_vocoder(vocoder_name, file_vocoder, "cuda")

def get_waveform(mel):
    return to_waveform(mel, vocoder, denoiser=None)
#---- code from cli.py ----

now find in line 167 the def on_validation_end(self) -> None:

https://github.com/shivammehta25/Matcha-TTS/blob/c8d0d60f87147fe340f4627b84588e812e5fbb00/matcha/models/baselightningmodule.py#L206

write this script in line 207

                #---- add audio to tensorboard ----
                waveform=get_waveform(output["mel"])    
                self.logger.experiment.add_audio(str(i), waveform, sample_rate=22050,global_step=self.current_epoch)
                #---- add audio to tensorboard ----

now in tensorboard you have new tab with audio and you can see 2 first audio also you can drag the slider to see the step how listen your model

image

shivammehta25 commented 1 year ago

Hello! @lpscr,

Thank you for the 🍵 super useful information this would also be my recommended way of doing it if one would want to load HiFi-GAN. However, one downside to this approach is that HiFi-GAN will compete with GPU memory bandwidth especially because it is loaded in the global. Either load it only after validation which might add further overhead if you need to save GPU memory. (classic tradeoff)

One other low-memory workaround would be using Griffin lim to convert to waveforms. However, the quality will hinder but I think you can get a high-level idea of how prosody is evolving. Coqui has an amazing source code which you can use.

https://github.com/coqui-ai/TTS/blob/6fef4f9067c0647258e0cd1d2998716565f59330/TTS/utils/audio/processor.py#L542

lpscr commented 1 year ago

@shivammehta25 thank you very much for the info very helpful :)

shivammehta25 commented 1 year ago

I am closing the issue for now. If you have any further questions please feel free to reopen it and continue the discussion.