Different Implementation of Diffusion Model

I'm a researcher working on building a TTS model using diffusion. While looking for the implementation of this, I found this repo.

According to my understanding of the paper, both the processes in the decoder diffusion model, forward and backward diffusion are supposed to take place on the latent space vector z [which is provided by UNET encoder part]. However, the repo's implementation seems to be different from this understanding. Could you give a reasoning behind this?

huawei-noah / Speech-Backbones

Different Implementation of Diffusion Model #35