Is the forced alligner tight enough to the phonemes?

Potential Issue: The forced alligner timings may not be very accurate for the TTS audio. This would create instances where the syllables are not quantized correctly to the rhythm.

Solution: Montreal forced alligner (MFA) is trainable, so we may be able to improve performance using audio from TTS.

Desired Action: Check to see if MFA timings are tight to the phonemes for syllable isolation. Is there a lot of time before or after you perceive a syllable in each isolated segment? Can we get some estimate of the error?

Notes: This is a very critical piece of the project. This is a high priority item

jerivl / Deepcut

Is the forced alligner tight enough to the phonemes? #12