Reason to use speaker encoder over speaker embeddings?

maum-ai / assem-vc

Official Code for Assem-VC @ICASSP2022

https://mindslab-ai.github.io/assem-vc/

BSD 3-Clause "New" or "Revised" License

265 stars 39 forks source link

Reason to use speaker encoder over speaker embeddings? #20

Closed dunky11 closed 3 years ago

dunky11 commented 3 years ago

What was the reason you switched from speaker embeddings (Cotatron) to a speaker encoder (this). Was it because it worked better? Or was it to support Any to Any voice conversion? I'm curious because I am currently trying to deploy my own architecture and can't really decide between the two.

wookladin commented 3 years ago

Hi, we used a speaker encoder over speaker embedding because speaker embedding can't capture the variation of speech within the same speaker. These variations may include the recording environment, the speaker's prosody, etc.

However, we have not performed extensive ablation studies of the benefits of speaker encoder over speaker embedding. So, we're not sure of the exact performance improvement we gain from using a speaker encoder.

dunky11 commented 3 years ago

Thank you very much, that cleared it up. In my experiments using an encoder worked better than using embeddings too.