Closed LWprogramming closed 1 year ago
@LWprogramming yea, i think i should extend the SoundStream
classes for some proper default
and yea, that sounds about right for the Transformer
, i chose ones that i can run on my machine
@LWprogramming does MusicLM
use a Soundstream
with greater number of quantizers?
Ah yeah, SoundStream
in MusicLM
is also 12 quantizers (page 4, left column, 3rd paragraph here)
@LWprogramming nice! have it as a default on MusicLMSoundStream
https://github.com/lucidrains/audiolm-pytorch/blob/main/audiolm_pytorch/soundstream.py#L733
Hmm, looking at the AudioLMSoundStream
right next to it, shouldn't rq_num_quantizers
also be 12 there? My understanding is that the number of quantizers affects bitrate for the embeddings specifically but is unrelated to target_sample_hz
, so we don't reduce quantizers there (also, if this is incorrect, then the default SoundStream implementation should have rq_num_quantizers
and target_sample_hz
compatible right?)
@LWprogramming i'm not sure actually; did you see 12 in the Audio LM paper? or the original soundstream paper
I just re-skimmed the soundstream paper and it seems like they tested various arrangements of quantizers, but AudioLM says specifically 12 (page 6, left column, 2nd paragraph)
The original Soundstream paper has 8 quantizers. The formula to determine the number of quantizers is:
$$N_q = r/(log_2 N)$$
Where $N_q$ is the number of quantizers, r is the number of bits for each output frame, and N is codebook size. With sampling rate = 24Khz, bitrate = 6Kbps, N = 1024, strides (3,4,5,8) (This implementation) we get r = 120 and $N_q = 12$. With strides = (2,4,5,8) we get r = 80 and $N_q = 8$ (The Soundstream paper).
r = bitrate / (sampling rate / product of strides)
@sohananisetty @LWprogramming ok, thank you both for looking up these details!
i've decided to default to 8 quantizers, as well as the (2, 4, 5, 8) with 16khz for the soundstream paper
but i have also fixed AudioLMSoundStream
to have 12 quantizers
Currently SoundStream has
rq_num_quantizers = 8
-- should this be 12 instead or is there something about the bitrate that's different from how the paper handles things?Also, should Transformer have 16 heads instead of 8 by default?
(based on what I read in the original paper :) )