[Help]: Number of total forward passes in the NaturalSpeech3

Hello, I have a question.

In the NaturalSpeech3 paper, it states that "the factorization diffusion model has a total of 60 forward passes." However, based on my understanding, it should be 4 × 2 for phoneme-level prosody, 4 for duration, and 4 × 2 for each token sequence of prosody, content, and acoustic details, which should be 8 (phoneme-level prosody) + 4 (duration) + 8 (token sequence of prosody) + 8 (token sequence of content) + 8 (token sequence of acoustic details) = 36 forward passes.

Could anyone explain this discrepancy? Thanks :)

@HeCheng0625

open-mmlab / Amphion

[Help]: Number of total forward passes in the NaturalSpeech3 #212