lucidrains / audiolm-pytorch

Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch
MIT License
2.36k stars 255 forks source link

SoundStream and Transformer num units #110

Closed LWprogramming closed 1 year ago

LWprogramming commented 1 year ago

Currently SoundStream has rq_num_quantizers = 8-- should this be 12 instead or is there something about the bitrate that's different from how the paper handles things?

Also, should Transformer have 16 heads instead of 8 by default?

(based on what I read in the original paper :) )

lucidrains commented 1 year ago

@LWprogramming yea, i think i should extend the SoundStream classes for some proper default

and yea, that sounds about right for the Transformer, i chose ones that i can run on my machine

lucidrains commented 1 year ago

@LWprogramming does MusicLM use a Soundstream with greater number of quantizers?

LWprogramming commented 1 year ago

Ah yeah, SoundStream in MusicLM is also 12 quantizers (page 4, left column, 3rd paragraph here)

lucidrains commented 1 year ago

@LWprogramming nice! have it as a default on MusicLMSoundStream https://github.com/lucidrains/audiolm-pytorch/blob/main/audiolm_pytorch/soundstream.py#L733

LWprogramming commented 1 year ago

Hmm, looking at the AudioLMSoundStream right next to it, shouldn't rq_num_quantizers also be 12 there? My understanding is that the number of quantizers affects bitrate for the embeddings specifically but is unrelated to target_sample_hz, so we don't reduce quantizers there (also, if this is incorrect, then the default SoundStream implementation should have rq_num_quantizers and target_sample_hz compatible right?)

lucidrains commented 1 year ago

@LWprogramming i'm not sure actually; did you see 12 in the Audio LM paper? or the original soundstream paper

LWprogramming commented 1 year ago

I just re-skimmed the soundstream paper and it seems like they tested various arrangements of quantizers, but AudioLM says specifically 12 (page 6, left column, 2nd paragraph)

sohananisetty commented 1 year ago

The original Soundstream paper has 8 quantizers. The formula to determine the number of quantizers is:

$$N_q = r/(log_2 N)$$

Where $N_q$ is the number of quantizers, r is the number of bits for each output frame, and N is codebook size. With sampling rate = 24Khz, bitrate = 6Kbps, N = 1024, strides (3,4,5,8) (This implementation) we get r = 120 and $N_q = 12$. With strides = (2,4,5,8) we get r = 80 and $N_q = 8$ (The Soundstream paper).

r = bitrate / (sampling rate / product of strides)

lucidrains commented 1 year ago

@sohananisetty @LWprogramming ok, thank you both for looking up these details!

i've decided to default to 8 quantizers, as well as the (2, 4, 5, 8) with 16khz for the soundstream paper

but i have also fixed AudioLMSoundStream to have 12 quantizers