lucidrains / audiolm-pytorch

Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch
MIT License
2.36k stars 255 forks source link

Soundstream Last Residual Block Activation Shape #146

Closed pranavmalikk closed 1 year ago

pranavmalikk commented 1 year ago

i was going through the soundstream implementation and verifying the tensor shapes. It seems I ran into a bump when calculating T /(H · 2^3) × F/2^6. This is the output of the last residual block which should be 128, yet i'm retrieving ([8, 256, 65, 2]). I'm slightly confused as to the terminology from the paper "At the output of the last residual block, the activations have shape T /(H · 2^3) × F/2^6 where T is the number of samples in the time domain and F = W/2 is the number of frequency bins."

For my audio sample:

T = number of samples in the time domain = 513 (height dimension) H = hop length = 256 W = window length = 1024 F = W/2 = 1024/2 = 512 (number of frequency bins)

I believe this means either the channels (256) or the height (65) should be 128 (target activations)