Closed dunky11 closed 3 years ago
Hi, we used a speaker encoder over speaker embedding because speaker embedding can't capture the variation of speech within the same speaker. These variations may include the recording environment, the speaker's prosody, etc.
However, we have not performed extensive ablation studies of the benefits of speaker encoder over speaker embedding. So, we're not sure of the exact performance improvement we gain from using a speaker encoder.
Thank you very much, that cleared it up. In my experiments using an encoder worked better than using embeddings too.
What was the reason you switched from speaker embeddings (Cotatron) to a speaker encoder (this). Was it because it worked better? Or was it to support Any to Any voice conversion? I'm curious because I am currently trying to deploy my own architecture and can't really decide between the two.