Closed lzl1456 closed 1 year ago
@lzl1456 the input needs to have a length that is divisible by the cumulative product of the strides
i curtail it otherwise for the reconstruction loss https://github.com/lucidrains/audiolm-pytorch/blob/main/audiolm_pytorch/soundstream.py#L587
thanks,about soundstream i use libi-light training, 50k steps ,data_max_length_seconds = 10s soundstream = SoundStream( codebook_size = 1024, target_sample_hz = 16000, rq_num_quantizers = 12, attn_window_size = 128, # local attention receptive field at bottleneck attn_depth = 2 # 2 local attention transformer blocks - the soundstream folks were not experts with attention, so i took the liberty to add some. encodec went with lstms, but attention should be better ).cuda()
Do you have a better training situation? At present, I train the model to compress and encode and restore it directly. Compared with the original audio, the loss is relatively large. Background noise (sounds like machinery) mixed in
@lzl1456 feel free to chat with other practitioners in the discussion boards
input shape = torch.Size([1, 225360])
output shape = torch.Size([1, 1, 225280])