ictnlp / LLaMA-Omni

LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
https://arxiv.org/abs/2409.06666
Apache License 2.0
2.56k stars 172 forks source link

Modelling of the prosody #13

Open Ming-er opened 2 months ago

Ming-er commented 2 months ago

Hi, it is a really interesting work, but I have a question about the modelling of the prosody. In the "2.4 Speech Decoder" section, I note that there is an operation "consecutive identical indices are merged into a single unit". I wonder if it will affect the prosody of the generated speech as the duration (repeating numbers) of semantic tokens (hubert units) may contain some prosody information. Besides, I think the consecutive tokens in Y^U will not affect the calculation of the CTC loss, so why do you remove them?

Poeroz commented 1 month ago

Hello, thank you for your question. We followed the common settings in speech-to-speech translation [1] and merged consecutive identical units. Previous work has found that this makes model training easier. As for prosody, there is a duration predictor before the HiFi-GAN, which predicts the duration corresponding to each unit and replicates it, so it does not affect prosody modeling.

[1] Direct speech-to-speech translation with discrete units.