The results of sampling with higher std that the one used during training seem to be a very interesting idea to produce the speech with high expressiveness but the impact on the quality is quite significant. From what I understood the core reason is a low density of the samples in the regions far from the center of the distribution. I wanted to do some experiments with Gaussian Mixture Models as a base distribution. For example, in emotions, each peak would represent a mean of speaking manner of a certain emotion. In this setup, loss function would be calculated based on the probability that the model assigns samples of given emotion to the correct peak. This would allow for sampling emotional speech with high quality and possibly even regions between them which would be equivalent to controlling the strength of the emotion in a sample. What is your opinion on that?
The results of sampling with higher std that the one used during training seem to be a very interesting idea to produce the speech with high expressiveness but the impact on the quality is quite significant. From what I understood the core reason is a low density of the samples in the regions far from the center of the distribution. I wanted to do some experiments with Gaussian Mixture Models as a base distribution. For example, in emotions, each peak would represent a mean of speaking manner of a certain emotion. In this setup, loss function would be calculated based on the probability that the model assigns samples of given emotion to the correct peak. This would allow for sampling emotional speech with high quality and possibly even regions between them which would be equivalent to controlling the strength of the emotion in a sample. What is your opinion on that?