gpt-omni / mini-omni

open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.
https://arxiv.org/abs/2408.16725
MIT License
3.09k stars 277 forks source link

Delay pattern decodding #80

Open wangers opened 2 months ago

wangers commented 2 months ago

In the paper the delay pattern makes a streaming decoding. However, i failed to find the method like build_delay_input_ids. The model seems has no tts adapter? so I'm not sure this is correct: now the model is parrel decoding rather than delay parrel decoding.

mini-omni commented 1 month ago

hi,

  1. please refer to: https://github.com/gpt-omni/mini-omni/tree/main?tab=readme-ov-file#faq, the open-source version do not support tts-adapter
  2. the model is delayed pattern, you can print: list_output at each inference step to confirm.
wangers commented 1 month ago

hi,

  1. please refer to: https://github.com/gpt-omni/mini-omni/tree/main?tab=readme-ov-file#faq, the open-source version do not support tts-adapter
  2. the model is delayed pattern, you can print: list_output at each inference step to confirm.

谢谢解答,我这里还列出几个问题不太清楚:

  1. 解码相关 ` if n_tensors == 7: for i in range(0, len(flattened_output), 8):

        tensor1.append(flattened_output[i + 1])
        tensor2.append(flattened_output[i + 2])
        tensor3.append(flattened_output[i + 3])
        tensor3.append(flattened_output[i + 4])
    
        tensor2.append(flattened_output[i + 5])
        tensor3.append(flattened_output[i + 6])
        tensor3.append(flattened_output[i + 7])
        codes = [
            list_to_torch_tensor(tensor1).to(device),
            list_to_torch_tensor(tensor2).to(device),
            list_to_torch_tensor(tensor3).to(device),
        ]

    ` delay

我的理解是delay pattern在并行token生成(1-7)时高层token会condition on前一个token,这符合RVQ的建模。snac码的特点是encoder出口到decoder入口分辨率提高。我将上面代码对应到snac的码表中,发现是一个前序遍历,按照我的理解(可能是错的?)第二层的序号5会condition on到第三层的序号3、4,所以,用右边的层序遍历是否有合理性呢。还有就是,分成7层而不是对应的3层是出于减少自回归步骤,降低延迟的考量吗?(此外,为什么选择snac编码呢,是因为snac压缩率高还是snac也是你们的工作呢)

另外,论文中 Prior to generating audio tokens, padding with N tokens ensures that the corresponding text tokens are produced first。这里的N是1还是平均序列长度呢。

  1. 训练相关:stage1是为了获得asr/tts能力,观察到词表有进行audio token扩充,但是在这个stage训练中LLM是冻结的,从输入到LLM出口只对adapter进行了训练。那在自回归teacher forcing训练audio token时,与连续的feature不同,离散token的声学表征空间看起来并没有与LLM文本空间对齐,这里让我比较困惑。此外,看到论文dataset部分,asr/tts表示为A1|T1,这里是指ASR、TTS的训练样本都是有audio + text tokens自回归训练吗,即sequence为混合text/audio token,还是说分别为asr->text token,tts->audio token。
  2. 在stage1中,用到的数据集都是较为干净的数据集,比如tts用的VCTK、Libritts,以及有声读物librispeech。没有用更为复杂场景的数据(比如Gigaspeech)是因为考虑到对audio token的影响吗?
  3. 请问开源版本中的音色是来源于指令微调的GPT-4o吗,如果想自定义音色,在这个框架上有什么建议吗?

期待您的解惑

================================================== My understanding is that during delayed parallel token generation, the high-level token will condition on the previous token, aligning with the modeling of RVQ. The characteristic of the snac code is that the resolution from the encoder to the decoder is enhanced. When I map the above code to the snac codebook, I find it corresponds to a preorder traversal. According to my understanding (which may be incorrect?), the index 5 in the second layer will condition on the indices 3 and 4 in the third layer. Therefore, is there a rationale for using the right-side level order traversal? Additionally, is the reason for dividing into 7 layers instead of the corresponding 3 layers to reduce autoregressive steps and lower latency? (Moreover, why choose snac encoding? Is it because snac has a high compression rate, or is snac also part of your work?)

Furthermore, in the paper, it states: “Prior to generating audio tokens, padding with N tokens ensures that the corresponding text tokens are produced first.” Here, does N refer to 1 or the average sequence length?

Regarding training: Stage 1 aims to acquire ASR/TTS capabilities. It is observed that the vocabulary has been expanded with audio tokens; however, in this stage, the LLM is frozen, and only the adapter is trained. When training audio tokens with autoregressive teacher forcing, unlike continuous features, the discrete token’s acoustic representation space does not seem to align with the LLM’s text space, which confuses me. Additionally, I noticed in the dataset section of the paper that ASR/TTS is represented as A1|T1. Does this mean that both ASR and TTS training samples involve autoregressive training with audio + text tokens, i.e., the sequence is a mix of text/audio tokens, or does it mean ASR -> text tokens and TTS -> audio tokens separately?

In Stage 1, the datasets used are relatively clean, such as VCTK and LibriTTS for TTS, as well as the audiobook LibriSpeech. Is the reason for not using more complex scenario data (like GigaSpeech) due to concerns about its impact on audio tokens?

Is the timbre in the open-source version derived from instruction-tuned GPT-4o? If one wants to customize the timbre, what suggestions do you have within this framework?