Closed syjunghwang closed 1 year ago
I'll take a closer look at this tomorrow, just noting here that I wrote the encodec code based on the shapes from the soundstream code (so encodec should be consistent with soundstream at the moment)
At the soundstream code you mentioned, it's set to (batchsize, timestep, quantizers(=8)) instead of (batchsize, quantizers(=8), timestep), so I think we can change the Output like the soundstream in the encoder python code.
Wait a minute, I think I figured out the mistake. The original Soundstream code is correct-- the problem I had was that x
is reshaped to batch x codebook size x timesteps
but crucially indices
is still batch x num_frames x num_quantizers
and each frame corresponds to a single set of ids (one for each quantizer). So I think the Soundstream code is compatible with the coarse transformer code and the error is in my Encodec implementation.
we can get [batchsize, encodec quantizer number, timesteps] from encodec, and I think that the codes related to cuting encoec output down to the number of coarse quantizers(=3) should be changed from " coarse_tokenids, = indices[..., :num_coarse_quantizers], indices[...,num_coarse_quantizers:] " to coarse_tokenids, = indices[..., :num_coarse_quantizers, :], indices[...,num_coarse_quantizers:,:] , am I mistaken?