F0-CONSYSTENT MANY-TO-MANY VOICE CONVERSION VIA CONDITIONAL AUTOENCODER

リンク

https://arxiv.org/abs/2004.07370

どんなもの？

Appending f0-conditioned input to VAE's decocder

先行研究と比べてどこがすごい？

Preventing f0-flipping for crossgender VC Strong obsequencey

技術と手法のキモはどこ？

Extract log-f0 and quantize the range 0~1 into 256 bins and uses as one-hot input to decoder

どうやって有効だと検証した？

After crossgender VC, by plotting f0-distribution, they found that the f0-dist of converted voice overlaps that of the target speaker, and there was no peak centered at the f0 of a different gender. They did MOS test and got 3.732 for quality, and 3.331 for similarity while a basic AutoVC got 3.546, 3.076 respectively.

supikiti / Awesome-tts-and-vc