lucidrains / audiolm-pytorch

Implementation of AudioLM, a SOTA Language Modeling Approach to Audio Generation out of Google Research, in Pytorch
MIT License
2.36k stars 255 forks source link

The original is not the same length as the soundstream output #115

Closed lzl1456 closed 1 year ago

lzl1456 commented 1 year ago

input shape = torch.Size([1, 225360])

output shape = torch.Size([1, 1, 225280])

lucidrains commented 1 year ago

@lzl1456 the input needs to have a length that is divisible by the cumulative product of the strides

i curtail it otherwise for the reconstruction loss https://github.com/lucidrains/audiolm-pytorch/blob/main/audiolm_pytorch/soundstream.py#L587

lzl1456 commented 1 year ago

thanks,about soundstream i use libi-light training, 50k steps ,data_max_length_seconds = 10s soundstream = SoundStream( codebook_size = 1024, target_sample_hz = 16000, rq_num_quantizers = 12, attn_window_size = 128, # local attention receptive field at bottleneck attn_depth = 2 # 2 local attention transformer blocks - the soundstream folks were not experts with attention, so i took the liberty to add some. encodec went with lstms, but attention should be better ).cuda()

Do you have a better training situation? At present, I train the model to compress and encode and restore it directly. Compared with the original audio, the loss is relatively large. Background noise (sounds like machinery) mixed in

lucidrains commented 1 year ago

@lzl1456 feel free to chat with other practitioners in the discussion boards