Open YuXiangLo opened 2 months ago
i want to ask this too. i haven't tested yet but I wonder how results differ if I change the end part to be mask0 EOS mask0 empty .
o I think I think I understand based on Jason's answer on a different question..... for zeroshot TTS , it looks like DIFFERENT model is trained without causal mask. like you can see for edits and tts there are two different weights !
As title mentioned, I wonder if we not mask the audio, namely y, then how can the model know there is a tts going to be conducted?