lucidrains / audiolm-pytorch

Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch
MIT License
2.36k stars 255 forks source link

coarsetransformer has wrong code? #155

Closed syjunghwang closed 1 year ago

syjunghwang commented 1 year ago

we can get [batchsize, encodec quantizer number, timesteps] from encodec, and I think that the codes related to cuting encoec output down to the number of coarse quantizers(=3) should be changed from " coarse_tokenids, = indices[..., :num_coarse_quantizers], indices[...,num_coarse_quantizers:] " to coarse_tokenids, = indices[..., :num_coarse_quantizers, :], indices[...,num_coarse_quantizers:,:] , am I mistaken?

LWprogramming commented 1 year ago

I'll take a closer look at this tomorrow, just noting here that I wrote the encodec code based on the shapes from the soundstream code (so encodec should be consistent with soundstream at the moment)

syjunghwang commented 1 year ago

At the soundstream code you mentioned, it's set to (batchsize, timestep, quantizers(=8)) instead of (batchsize, quantizers(=8), timestep), so I think we can change the Output like the soundstream in the encoder python code.

LWprogramming commented 1 year ago

Wait a minute, I think I figured out the mistake. The original Soundstream code is correct-- the problem I had was that x is reshaped to batch x codebook size x timesteps but crucially indices is still batch x num_frames x num_quantizers and each frame corresponds to a single set of ids (one for each quantizer). So I think the Soundstream code is compatible with the coarse transformer code and the error is in my Encodec implementation.