Closed juliakorovsky closed 1 week ago
@juliakorovsky yes that's right, they brought this design over from voicebox
think this is one of those details that matters very little, much like concat vs add for unet skips
i can add the concat way for completeness sake, let's leave the issue open.
@juliakorovsky https://github.com/lucidrains/e2-tts-pytorch/commit/dc3bf8f822428bca728b1f18ad72bb3e1ac1bf2f#diff-aa4e45ab239723e4fe2411800a686d4b7d1a7235044da0fec1ce924182e03aceR707 let us know if you see a big difference using concat
In the paper they say: "The input to the mel spectrogram generator is m ⊙ sˆ, yˆ, the flow step t, and noisy speech st. yˆ is first converted to character embedding sequence. Then, m ⊙ sˆ, st, y˜ are all stacked to form a tensor with a shape of (2 · D + E) × T, followed by a linear layer to output a tensor with a shape of D × T. In your code I found this line: https://github.com/lucidrains/e2-tts-pytorch/blob/9d5fc1b4fe6e0fecd0e5e43681be0c6d2d1732ec/e2_tts_pytorch/e2_tts.py#L796
Do I understand it right: they talk about concatenation and you do addition? I wasn't able to find where dimension would be (2 · D + E) × T in your code.