Open Ming-er opened 2 months ago
Hello, thank you for your question. We followed the common settings in speech-to-speech translation [1] and merged consecutive identical units. Previous work has found that this makes model training easier. As for prosody, there is a duration predictor before the HiFi-GAN, which predicts the duration corresponding to each unit and replicates it, so it does not affect prosody modeling.
[1] Direct speech-to-speech translation with discrete units.
Hi, it is a really interesting work, but I have a question about the modelling of the prosody. In the "2.4 Speech Decoder" section, I note that there is an operation "consecutive identical indices are merged into a single unit". I wonder if it will affect the prosody of the generated speech as the duration (repeating numbers) of semantic tokens (hubert units) may contain some prosody information. Besides, I think the consecutive tokens in Y^U will not affect the calculation of the CTC loss, so why do you remove them?