lwang114 / UnsupTTS

MIT License
35 stars 4 forks source link

Speech_Audio Alignment #3

Open Curiosci opened 3 weeks ago

Curiosci commented 3 weeks ago

Hi. I am trying to understand you approach and I still don't quite see how alignments are done for unrelated text and speech corporas. Could you please explain that and point out the files in the code that implement it ?

JeromeNi commented 3 weeks ago

Our method first trains a wav2vec-U system using unpaired speech and text, followed by pseudo-transcript self-training. Note that we simply call every pair of (speech, pseudo-transcript) as an alignment. We do not align every speech utterance with the ground-truth transcript.

To obtain the pseudo-transcript for each speech utterance and to prepare TTS training data in ESPnet v1 style, see stage 4 and stage 5 in run_css10_cpy2.slurm

Curiosci commented 1 day ago

Thanks for your answer. How about pseudo-transcript self-training? How is it done ? and why was it necessary in your system? also how does it fit in the GAN architecture ?

cactuswiththoughts commented 18 hours ago

Hi, you can refer to the paper Unsupervised Speech Recognition and its cited works for more details of the self-training process. Basically it is a refinement step to improve the pseudo-transcript further and is independent of the GAN architecture.