Closed karynaur closed 7 hours ago
Yes this is expected, you need to pad the last frame to have a shape multiple of 1920 (the frame size of Mimi). We cannot do it automatically as we can never know if something is the last frame or not!
i have clarified this point in the code snippet https://github.com/kyutai-labs/moshi/tree/main/moshi#api
Backend impacted
The PyTorch implementation
Operating system
Linux
Hardware
CPU
Description
the output for this ^ is:
decoded.shape, wav.shape = (torch.Size([1, 1, 240000]), torch.Size([1, 1, 240000]))
Works perfect!Output:
decoded.shape, wav.shape = (torch.Size([1, 1, 195840]), torch.Size([1, 1, 194882]))
why the difference in sizes and whats the extra information in decoded?
Extra information
Installed with
pip install moshi
Environment
Fill in the following information on your system.
If the backend impacted is PyTorch:
python -c 'import torch; print(torch.version.cuda)'
): 12.1If the backend is MLX: