lucidrains / voicebox-pytorch

Implementation of Voicebox, new SOTA Text-to-speech network from MetaAI, in Pytorch
MIT License
589 stars 49 forks source link

Impl time embedding #9

Closed yzmyyff closed 1 year ago

yzmyyff commented 1 year ago

implement time embedding according to the paper.

The implementation of ODE time embedding is in section 3.3. We should concatenate X_c and time embedding x_t along the time dimension.

lucidrains commented 1 year ago

@yzmyyff hey Kuang! thank you for the PR

so if you turn back the clocks, I had it this way at an earlier commit. however, I don't think this makes sense with relative positional encoding, not without introducing more complexity

that's why it got switched over to adaptive normalization methods, which a number of text to image works have used successfully

lucidrains commented 1 year ago

@yzmyyff the reason is because there should be no relative distance to the time token

yzmyyff commented 1 year ago

Thanks for your reply. You are right about sin pos embedding. But the difference between the current implementation and the original paper is a bit confusing, have you considered adding an implementation that is the same as the paper? The optimized implementation is specified by extra parameters, or something else called VoiceBox 2

lucidrains commented 1 year ago

@yzmyyff i don't think it is correct, so i'm just going to make a note of the differences in the readme (mainly the choice of rotary embeddings, as well as adaptive layernorm for conditioning on time)