About concatenation VS addition

juliakorovsky commented 1 week ago

In the paper they say: "The input to the mel spectrogram generator is m ⊙ sˆ, yˆ, the flow step t, and noisy speech st. yˆ is first converted to character embedding sequence. Then, m ⊙ sˆ, st, y˜ are all stacked to form a tensor with a shape of (2 · D + E) × T, followed by a linear layer to output a tensor with a shape of D × T. In your code I found this line: https://github.com/lucidrains/e2-tts-pytorch/blob/9d5fc1b4fe6e0fecd0e5e43681be0c6d2d1732ec/e2_tts_pytorch/e2_tts.py#L796

Do I understand it right: they talk about concatenation and you do addition? I wasn't able to find where dimension would be (2 · D + E) × T in your code.

lucidrains commented 1 week ago

@juliakorovsky yes that's right, they brought this design over from voicebox

think this is one of those details that matters very little, much like concat vs add for unet skips

lucidrains commented 1 week ago

i can add the concat way for completeness sake, let's leave the issue open.

lucidrains commented 1 week ago

@juliakorovsky https://github.com/lucidrains/e2-tts-pytorch/commit/dc3bf8f822428bca728b1f18ad72bb3e1ac1bf2f#diff-aa4e45ab239723e4fe2411800a686d4b7d1a7235044da0fec1ce924182e03aceR707 let us know if you see a big difference using concat

lucidrains / e2-tts-pytorch

About concatenation VS addition #32