lucidrains / audiolm-pytorch

Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch
MIT License
2.41k stars 256 forks source link

encodec error #158

Closed syjunghwang closed 1 year ago

syjunghwang commented 1 year ago

The coarset transformer is learned when using soundstream, but not when using encodec.

LWprogramming commented 1 year ago

Hm, are you getting a particular error message for coarse transformer? As I understand it, by the time we give a codec to either coarse or fine transformer, the codec is already pretrained, so the logic with torch.no_grad() is fine.

syjunghwang commented 1 year ago

@LWprogramming One-dimensional wavform(ex.[(96000)]) are derived from the data.py, but if i put that input to the current encodec code, they don't run. So, in the front of the code, 'wav.unsqueeze (0)' must be inserted into the encodec code.

LWprogramming commented 1 year ago

I added an assertion in my personal_hacks branch but was still able to train coarse correctly. Here is the demo script I am using (I may insert assertion errors as reminders to myself but you can ignore those if you want to try using it to replicate), and here is the script to use specifically my personal_hacks branch.

Is it raising an error when you train using the demo script? If so, it's possible that there's some other processing between when it comes from data.py and when it actually enters encodec, correctly shaping it (in which case I don't think we'd need unsqueeze)

syjunghwang commented 1 year ago

Like the demo code you showed me, I checked that there is no error if I give the data_max_secondsparameter as 320*30 in the CoarseTransformerTrainer. However, In the CoarseTransformerTrainer, if i give the parameter (data_max_length_seconds = 30), the following error comes. AssertionError: Expected indices to have shape (batch, num_frames, num_coarse_quantizers + num_fine_quantizers), but got torch.Size([1, 2250, 8])