state-spaces / s4

Structured state space sequence models
Apache License 2.0
2.37k stars 284 forks source link

Questions about SaShiMi training data format #65

Closed RoiEXLab closed 2 years ago

RoiEXLab commented 2 years ago

Hello everyone, I'm currently working on a private project where I'm planning on generating audio using a custom dataset. So after a while I realized that inversing spectrograms I planned on generating using a GAN wasn't resulting in the high fidelity audio I'd hoped for. So I decided to switch to a model that operates on raw audio waveforms instead. First I read about WaveNet which is probably the go-to Paper regarding this topic and after a while I also found "It's Raw! Audio Generation with State-Space Models" and the proposed SaShiMi model. I have a mediocre understanding of machine learning, but not enough to understand every little details of both papers, so apologies if I misunderstood something. Anyway, I was successfully able to generate a couple of sound samples using this repository on my machine which literally sounded promising (I believe the documentation is not completely up to date, but I got it working regardless). So I'm planning on using the standalone implementation of the SaShiMi model you kindly provided.

However I realized that I haven't quite understood how the model is supposed to be used in practice. One thing I believe to have understood is that the model is used in "convolutional mode" for training because it can be done in parallel, whereas the "recurrent mode" has to be used for audio generation, where one sample is generated at a time. I haven't understood why this CNN representation can't be used for generation as well (but I believe the answer is obvious once I understand the rest of it).

Also the standalone example creates a random tensor:

torch.randn(batch_size, seq_len, dim)

The batch dimension is pretty self-explanatory, I assume the sequence length dimension to be SECONDS * SAMPLE_RATE for the dataset chunks, I don't know what dim is supposed to represent there. I was originally thinking it might be audio channels, but I soon realized that the datasets seem to be mono-audio only (kinda expected to be honest, most datasets are this way) and the default dimension size seems to be 64.

So what does it represent? How would I use my 30min audio clip, chunked into samples of a couple of seconds each to feed the model? Also do you think SaShiMi could be adjusted to work with stereo audio samples? Perhaps by adding some 1d convolution layers in front and behind of the whole model?

The final question I have is about the output the model. What does it represent? The paper speaks about probabilities of waveforms, but how can those probabilities be turned back into actual waveforms, i.e. amplitude values of individual samples? Unfortunately I wasn't able to reverse-engineer the reference implementation you're using.

Thanks in advance for your time and for providing an open reference implementation. Sorry if the questions are too basic.

albertfgu commented 2 years ago

I haven't understood why this CNN representation can't be used for generation as well

It can be used for generation, but it will be a lot slower than the RNN mode.

I was originally thinking it might be audio channels, but I soon realized that the datasets seem to be mono-audio only (kinda expected to be honest, most datasets are this way) and the default dimension size seems to be 64.

You can run one of the full audio training examples (e.g. python -m train experiment=audio/sashimi-sc09) and print or trace through the entire model to understand how it works. Briefly, the model consists of (i) an encoder layer to map the 1-D audio channel to higher dimensions (e.g. 64), (ii) a bunch of repeating blocks that transform sequences of shape (batch, length, dim), (iii) a decoder layer that maps it to the shape the loss function expects. This is very similar to how other deep learning models such as ResNets and Transformers work, so it would be best to understand these standard models first.

Also do you think SaShiMi could be adjusted to work with stereo audio samples?

I imagine there should be simple modifications to handle stereo, but I have no experience with stereo so I can't make concrete suggestions. This sounds like a more general architecture question about deep neural networks; any approach that might work for WaveNet should work for Sashimi as well.

The final question I have is about the output the model. What does it represent? The paper speaks about probabilities of waveforms, but how can those probabilities be turned back into actual waveforms, i.e. amplitude values of individual samples?

The model tries to model the following: given a sequence of samples $x_1, x_2, \dots, xt$, guess the distribution of $x{t+1}$. This can be used to give probabilities of existing waveforms, or generate new waveforms.

Most of these questions are not about Sashimi/S4 but deep autoregressive models more broadly. Rather than reading the latest papers, I typically find it more helpful reading blog posts or lecture notes for background before getting into specific models like Sashimi. An example is Stanford's generative modeling course, but there are many other good resources.

RoiEXLab commented 2 years ago

@albertfgu Thank you very much for your patience and the insight!