If I change the input representation, can I still calculate loss the "normal" way from raw audio input?

ibab / tensorflow-wavenet

A TensorFlow implementation of DeepMind's WaveNet paper

MIT License

5.41k stars 1.29k forks source link

So I hope it's ok I use GH issues to bring up ideas...

I was thinking about the long term structural components of sounds that are more varied that speech. I understand the idea of the dilation and perception field. Still, I wonder if there are unexplored tricks to improve the convergence by at least a magnitude.

So my idea is to explore, for example, pyramidal decomposition of the input signal, using wavelet transform, e.g. with Haar filter. This would give a mix between time domain signal (as it is used now) and frequency domain signal. Instead of feeding a sequence of single samples to the input layer, I would feed vectors in the wavelet domain, simply using NN upsampling for the decimated lower frequency bands. Say I want to represent at least a dozen seconds, at music quality sr 44.1 kHz that's 529200 sample frames or 19 levels of wavelet decimation. So I'm doing away with 256 channels mu-law one-hot, and go to 19 channels scalar input.

Now, my idea is that this could condition the network to better model long term envelopes etc. But I would interpret the output layer still as producing raw audio time domain samples. That is, I would use a loss function still based on the raw audio time domain input signal, and not use the wavelet decimated input at all for the loss function. Does that sound reasonable?

The original paper seems to suggest that we have a handful of controls, all of which can contribute to the same goal (long term memory):

2.6 Context Stacks

... several different ways to increase the receptive field size of a WaveNet: increasing the number of dilation stages, using more layers, larger filters, greater dilation factors, or a combination thereof.

Hi @Sciss, of course! :) I hear a lot of science is happening on Twitter nowadays, though.

WaveNet is a recursive model; although it has access to a wide range of samples in the past to generate the current sample (unlike recurrent networks, which only have access to the "input conditioning" and the last internal state), samples are still generated with a "feedback loop". This means the input has to match the output (either a scalar value or a probability distribution), as the output will be in the input for the next sample. You can do any pre-processing, such as embed a local or global condition to the input, but this pre-processing should be casual, which frequency domain transforms are not. The model as it is implemented now further requires that the pre-processing is point-wise, applied sample-by-sample.

What some people have done is to use as input and output a time-frequency distribution, and then use an external means of conversion (such as the short-time spectral magnitude, using STFT to get input and the Griffin-Lim algorithm to reconstruct the phase of the output). Others have even trained two networks; one that learns how to predict the spectrum, and the other that is conditioned on the spectrum and predicts the raw audio. That would be the easy way to go if you want to incorporate wavelets.

These memory controls are mostly in the hyperparameters in wavenet_params.json, although the filter width ("larger filters") is fixed to 2 (https://github.com/ibab/tensorflow-wavenet/blob/master/wavenet/model.py#L340). However, to me, the term Long-Term Memory suggests that the model can "remember" things by storing them and recalling them at a later stage. The WaveNet will always have a limited working memory, although having dilated convolutions (that's its true strength) allows that memory to be very large by not including redundant information and having separate "slots" for different time scales.

ibab / tensorflow-wavenet

If I change the input representation, can I still calculate loss the "normal" way from raw audio input? #272

2.6 Context Stacks