Using knowledge distillation(KD) for robust Tacotoron2 based TTS
先行研究と比べてどこがすごい?
Previous research:
A basic Tacotron2's decoder generates a frame in scheduled sampling mode in training iteration. However, at generating term, that generates in free running mode. This method is weak of generating outer side of train-data distribution because of the missmatch of natural data and predicted data.
Proposed method:
This work uses teacher model and student model. Teacher model trains teacher decoder in teacher forcing mode which the decoder inputs one-step-shifted natural data. The loss is l2 norm between real data and generated data.
Student model uses teacher encoder and trains decoder in free running mode which the decoder inputs decoder's output of one step before. The loss are l2 norm between real data and generated data, and square error between ontput of trained teacher decoder and ontput of student decoder. In this way, the student decoder indicates robustness in generating onter range of training data.
技術と手法のキモはどこ?
Written in above
どうやって有効だと検証した?
Compearing with Tacotron with scheduled sampling (Tacotron SS) by MOS and WER
English
MOS WER
SS 3.21 23.82%
KD 3.93 2.17
リンク
https://ieeexplore.ieee.org/document/9054681
どんなもの?
Using knowledge distillation(KD) for robust Tacotoron2 based TTS
先行研究と比べてどこがすごい?
Previous research: A basic Tacotron2's decoder generates a frame in scheduled sampling mode in training iteration. However, at generating term, that generates in free running mode. This method is weak of generating outer side of train-data distribution because of the missmatch of natural data and predicted data. Proposed method: This work uses teacher model and student model. Teacher model trains teacher decoder in teacher forcing mode which the decoder inputs one-step-shifted natural data. The loss is l2 norm between real data and generated data. Student model uses teacher encoder and trains decoder in free running mode which the decoder inputs decoder's output of one step before. The loss are l2 norm between real data and generated data, and square error between ontput of trained teacher decoder and ontput of student decoder. In this way, the student decoder indicates robustness in generating onter range of training data.
技術と手法のキモはどこ?
Written in above
どうやって有効だと検証した?
Compearing with Tacotron with scheduled sampling (Tacotron SS) by MOS and WER English MOS WER SS 3.21 23.82% KD 3.93 2.17
Chinese SS 3.18 9.44 KD 3.94 0.67
議論はある?
none
次に読むべき論文