Why inference TTS doesn't need to mask?

jasonppy / VoiceCraft

Zero-Shot Speech Editing and Text-to-Speech in the Wild

Other

7.51k stars 739 forks source link

Why inference TTS doesn't need to mask? #146

Open YuXiangLo opened 2 months ago

YuXiangLo commented 2 months ago

As title mentioned, I wonder if we not mask the audio, namely y, then how can the model know there is a tts going to be conducted?

zmy1116 commented 1 month ago

i want to ask this too. i haven't tested yet but I wonder how results differ if I change the end part to be mask0 EOS mask0 empty .

zmy1116 commented 1 month ago

o I think I think I understand based on Jason's answer on a different question..... for zeroshot TTS , it looks like DIFFERENT model is trained without causal mask. like you can see for edits and tts there are two different weights !