Closed inspirit closed 1 year ago
@inspirit oh hey Eugene! do you mean Act -> Conv -> Act -> Conv
? i feel like almost all architecture i've seen have activations following the convolutions, or maybe i misunderstood
could you write out what you are thinking in (pseudo)code?
Let me try) I'm mainly comparing with EnCodec architecture which is very similar. currently Soundstream encoder have following sequence:
CausalConv1d(...), # input projection
[
[CausalConv1d(...), ELU(), CausalConv1d(...), ELU()] x 3 # residual blocks
CausalConv1d(...) # downsample
] x N
CausalConv1d(...), # to codebook projection
as you can see ELU activations are only correctly placed between residual blocks and before downsample but after downsample we dont have any activation and we dont have activation after input projection and before codebook projections. if we look at EnCodec encode/decoder blocks they have activation between all convolutional blocks:
Conv1d(...), # input projection
[
[ELU(), Conv1d(...), ELU(), Conv1d(...)] x N # residual blocks
ELU(),
Conv1d(...) # downsample
] x N
ELU(),
Conv1d(...), # to codebook projection
Additionally I wonder if dilation factors should be reversed in Decoder block? we have 1-3-9 dilation in encoder so i assume in decoder it should be reversed 9-3-1. Though it looks like not common practice most implementations keep the order of dilations the same between encoder/decoder
@inspirit thanks for finding that! this is the first time i've seen it
in the Soundstream paper, they were unclear about where the activations were placed. if the Meta folks decided on using that configuration and got the results they got, let's go for that as well!
re: dilation, i'm not really sure! did EnCodec also use that?
nope dilation is just some random question that comes to my mind, I think i have seen some enc/dec implementations reversing it but its definitely not common :)
@inspirit ok, i've made it configurable :smile: thanks Eugene!
Hi Phil, I was wondering if ELU activations should be moved up before Conv? considering final encoder/decoder compositions: