Modelling of the prosody

ictnlp / LLaMA-Omni

LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.

Apache License 2.0

2.56k stars 172 forks source link

Hi, it is a really interesting work, but I have a question about the modelling of the prosody. In the "2.4 Speech Decoder" section, I note that there is an operation "consecutive identical indices are merged into a single unit". I wonder if it will affect the prosody of the generated speech as the duration (repeating numbers) of semantic tokens (hubert units) may contain some prosody information. Besides, I think the consecutive tokens in Y^U will not affect the calculation of the CTC loss, so why do you remove them?

ictnlp / LLaMA-Omni

Modelling of the prosody #13