lucidrains / audiolm-pytorch

Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch
MIT License
2.37k stars 255 forks source link

Activation units position #58

Closed inspirit closed 1 year ago

inspirit commented 1 year ago

Hi Phil, I was wondering if ELU activations should be moved up before Conv? considering final encoder/decoder compositions:

  1. Emb conv from input
  2. NxEncoderBlock [3xResidualUnit + DownsampleConv] so if we position activations before Conv in ResidualUnit it will follow pattern [Conv -> Act -> Conv ...] we will also need to add additional Act to EncoderBlock: [3xResidualUnit + Act + DownsampleConv] https://github.com/lucidrains/audiolm-pytorch/blob/daeedb2798fed2275a8d6a67ecf150fdf2396dbc/audiolm_pytorch/soundstream.py#L225
lucidrains commented 1 year ago

@inspirit oh hey Eugene! do you mean Act -> Conv -> Act -> Conv ? i feel like almost all architecture i've seen have activations following the convolutions, or maybe i misunderstood

could you write out what you are thinking in (pseudo)code?

inspirit commented 1 year ago

Let me try) I'm mainly comparing with EnCodec architecture which is very similar. currently Soundstream encoder have following sequence:

CausalConv1d(...), # input projection
    [
        [CausalConv1d(...), ELU(), CausalConv1d(...), ELU()] x 3 # residual blocks
        CausalConv1d(...) # downsample
    ] x N
CausalConv1d(...), # to codebook projection

as you can see ELU activations are only correctly placed between residual blocks and before downsample but after downsample we dont have any activation and we dont have activation after input projection and before codebook projections. if we look at EnCodec encode/decoder blocks they have activation between all convolutional blocks:

Conv1d(...), # input projection
    [
        [ELU(), Conv1d(...), ELU(), Conv1d(...)] x N # residual blocks
        ELU(),
        Conv1d(...) # downsample
    ] x N
ELU(),
Conv1d(...), # to codebook projection
inspirit commented 1 year ago

Additionally I wonder if dilation factors should be reversed in Decoder block? we have 1-3-9 dilation in encoder so i assume in decoder it should be reversed 9-3-1. Though it looks like not common practice most implementations keep the order of dilations the same between encoder/decoder

lucidrains commented 1 year ago

@inspirit thanks for finding that! this is the first time i've seen it

in the Soundstream paper, they were unclear about where the activations were placed. if the Meta folks decided on using that configuration and got the results they got, let's go for that as well!

re: dilation, i'm not really sure! did EnCodec also use that?

inspirit commented 1 year ago

nope dilation is just some random question that comes to my mind, I think i have seen some enc/dec implementations reversing it but its definitely not common :)

lucidrains commented 1 year ago

@inspirit ok, i've made it configurable :smile: thanks Eugene!