If we need to infer a TTS audio in this system, then we maybe just need the audio-prompts and the text which we want to synthesis?
why the text-prompts is needed in inference parser.
because during training we didn't provide the text to the prompt; even more, we didn't distinguish between prompt and target during training.
so, we have to provide as much text prompt as the audio prompt.
If we need to infer a TTS audio in this system, then we maybe just need the audio-prompts and the text which we want to synthesis? why the text-prompts is needed in inference parser.