Open wangers opened 2 months ago
hi,
hi,
- please refer to: https://github.com/gpt-omni/mini-omni/tree/main?tab=readme-ov-file#faq, the open-source version do not support tts-adapter
- the model is delayed pattern, you can print: list_output at each inference step to confirm.
谢谢解答,我这里还列出几个问题不太清楚:
解码相关 ` if n_tensors == 7: for i in range(0, len(flattened_output), 8):
tensor1.append(flattened_output[i + 1])
tensor2.append(flattened_output[i + 2])
tensor3.append(flattened_output[i + 3])
tensor3.append(flattened_output[i + 4])
tensor2.append(flattened_output[i + 5])
tensor3.append(flattened_output[i + 6])
tensor3.append(flattened_output[i + 7])
codes = [
list_to_torch_tensor(tensor1).to(device),
list_to_torch_tensor(tensor2).to(device),
list_to_torch_tensor(tensor3).to(device),
]
`
我的理解是delay pattern在并行token生成(1-7)时高层token会condition on前一个token,这符合RVQ的建模。snac码的特点是encoder出口到decoder入口分辨率提高。我将上面代码对应到snac的码表中,发现是一个前序遍历,按照我的理解(可能是错的?)第二层的序号5会condition on到第三层的序号3、4,所以,用右边的层序遍历是否有合理性呢。还有就是,分成7层而不是对应的3层是出于减少自回归步骤,降低延迟的考量吗?(此外,为什么选择snac编码呢,是因为snac压缩率高还是snac也是你们的工作呢)
另外,论文中 Prior to generating audio tokens, padding with N tokens ensures that the corresponding text tokens are produced first。这里的N是1还是平均序列长度呢。
期待您的解惑
================================================== My understanding is that during delayed parallel token generation, the high-level token will condition on the previous token, aligning with the modeling of RVQ. The characteristic of the snac code is that the resolution from the encoder to the decoder is enhanced. When I map the above code to the snac codebook, I find it corresponds to a preorder traversal. According to my understanding (which may be incorrect?), the index 5 in the second layer will condition on the indices 3 and 4 in the third layer. Therefore, is there a rationale for using the right-side level order traversal? Additionally, is the reason for dividing into 7 layers instead of the corresponding 3 layers to reduce autoregressive steps and lower latency? (Moreover, why choose snac encoding? Is it because snac has a high compression rate, or is snac also part of your work?)
Furthermore, in the paper, it states: “Prior to generating audio tokens, padding with N tokens ensures that the corresponding text tokens are produced first.” Here, does N refer to 1 or the average sequence length?
Regarding training: Stage 1 aims to acquire ASR/TTS capabilities. It is observed that the vocabulary has been expanded with audio tokens; however, in this stage, the LLM is frozen, and only the adapter is trained. When training audio tokens with autoregressive teacher forcing, unlike continuous features, the discrete token’s acoustic representation space does not seem to align with the LLM’s text space, which confuses me. Additionally, I noticed in the dataset section of the paper that ASR/TTS is represented as A1|T1. Does this mean that both ASR and TTS training samples involve autoregressive training with audio + text tokens, i.e., the sequence is a mix of text/audio tokens, or does it mean ASR -> text tokens and TTS -> audio tokens separately?
In Stage 1, the datasets used are relatively clean, such as VCTK and LibriTTS for TTS, as well as the audiobook LibriSpeech. Is the reason for not using more complex scenario data (like GigaSpeech) due to concerns about its impact on audio tokens?
Is the timbre in the open-source version derived from instruction-tuned GPT-4o? If one wants to customize the timbre, what suggestions do you have within this framework?
In the paper the delay pattern makes a streaming decoding. However, i failed to find the method like
build_delay_input_ids
. The model seems has no tts adapter? so I'm not sure this is correct: now the model is parrel decoding rather than delay parrel decoding.