Open Curiosci opened 3 weeks ago
Our method first trains a wav2vec-U system using unpaired speech and text, followed by pseudo-transcript self-training. Note that we simply call every pair of (speech, pseudo-transcript) as an alignment. We do not align every speech utterance with the ground-truth transcript.
To obtain the pseudo-transcript for each speech utterance and to prepare TTS training data in ESPnet v1 style, see stage 4 and stage 5 in run_css10_cpy2.slurm
Thanks for your answer. How about pseudo-transcript self-training? How is it done ? and why was it necessary in your system? also how does it fit in the GAN architecture ?
Hi, you can refer to the paper Unsupervised Speech Recognition and its cited works for more details of the self-training process. Basically it is a refinement step to improve the pseudo-transcript further and is independent of the GAN architecture.
Hi. I am trying to understand you approach and I still don't quite see how alignments are done for unrelated text and speech corporas. Could you please explain that and point out the files in the code that implement it ?