Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
In the NaturalSpeech3 paper, it states that "the factorization diffusion model has a total of 60 forward passes." However, based on my understanding, it should be 4 × 2 for phoneme-level prosody, 4 for duration, and 4 × 2 for each token sequence of prosody, content, and acoustic details, which should be 8 (phoneme-level prosody) + 4 (duration) + 8 (token sequence of prosody) + 8 (token sequence of content) + 8 (token sequence of acoustic details) = 36 forward passes.
Hello, I have a question.
In the NaturalSpeech3 paper, it states that "the factorization diffusion model has a total of 60 forward passes." However, based on my understanding, it should be 4 × 2 for phoneme-level prosody, 4 for duration, and 4 × 2 for each token sequence of prosody, content, and acoustic details, which should be 8 (phoneme-level prosody) + 4 (duration) + 8 (token sequence of prosody) + 8 (token sequence of content) + 8 (token sequence of acoustic details) = 36 forward passes.
Could anyone explain this discrepancy? Thanks :)
@HeCheng0625