Closed yzmyyff closed 1 year ago
@yzmyyff hey Kuang! thank you for the PR
so if you turn back the clocks, I had it this way at an earlier commit. however, I don't think this makes sense with relative positional encoding, not without introducing more complexity
that's why it got switched over to adaptive normalization methods, which a number of text to image works have used successfully
@yzmyyff the reason is because there should be no relative distance to the time token
Thanks for your reply. You are right about sin pos embedding. But the difference between the current implementation and the original paper is a bit confusing, have you considered adding an implementation that is the same as the paper? The optimized implementation is specified by extra parameters, or something else called VoiceBox 2
@yzmyyff i don't think it is correct, so i'm just going to make a note of the differences in the readme (mainly the choice of rotary embeddings, as well as adaptive layernorm for conditioning on time)
implement time embedding according to the paper.
The implementation of ODE time embedding is in section 3.3. We should concatenate X_c and time embedding x_t along the time dimension.