shivammehta25 / Matcha-TTS

[ICASSP 2024] 🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching
https://shivammehta25.github.io/Matcha-TTS/
MIT License
746 stars 96 forks source link

the motivation for inserting blank IDs between the input IPA-ids? #94

Open dbkest opened 3 months ago

dbkest commented 3 months ago

Hello, could you please help me understand the motivation for inserting blank IDs between the input IPA-ids? The implementation code can be found in text_mel_datamodule.py line216:

def get_text(self, text, add_blank=True): text_norm, cleaned_text = text_to_sequence(text, self.cleaners) if self.add_blank: #True text_norm = intersperse(text_norm, 0) text_norm = torch.IntTensor(text_norm) return text_norm, cleaned_text

thanks.

shivammehta25 commented 3 months ago

Hello, that is a great question!

TLDR: The idea comes from multiple states per phone in a Hidden Markov Model (HMM) based speech synthesisers for better modelling. [Our previous work Neural-HMM and OverFlow have also used that.] Since Monotonic Alignment Search (MAS) (introduced in Glow-TTS) is a Viterbi approximation to the forward algorithm the idea has its root from the same literature. You can use multiple states to model the transition between different sounds.

More details: In Statistical Parametric Speech Synthesis (SPSS) times (You can read more about it here in section 2.2 right below equation 2.28), people used multiple states to model each phoneme. They found it beneficial to model certain dynamic features with more states which were especially useful in modelling certain sounds for example plosives (In English: p, t, k, b, d, g), where you have silence, sudden burst in energy and then silence again. These were hard to model for a left-to-right algorithm with no skip (like the MAS) without multiple states representing them as each state had emission parameters.

Modern neural network-based speech synthesisers are much more powerful approximators. So, the idea behind adding an extra state is to provide a placeholder for the MAS to learn such dynamic variation and transition between sounds, where two states seem to be a nice compromise between having the model learn these dynamic variations when needed and jumping directly to next sound in case, it doesn't need to learn that variation (some transitions don't need a gap between them) and also fewer tensors on the GPU than having 3 states like in HMM-based synthesisers.

Hope this helps :)