Stereo Implementation - Githubissues

pullahs commented 6 years ago

Hey there,

I'm doing my masters degree on Sonic Arts and trying to come up with a good thesis on applications of machine learning on audio. I've been playing around with magenta and wavenet (thanks to you people!). Magenta is interesting as well but synthesizing sound from scratch, sample by sample is super interesting and revolutionary in my opinion. So, I decided to put my efforts on this one and I've been trying to coming up with some nice ideas lately.

I'll try to explain some stuff that I did so far.

Training the model with Impulse Response files (IR) :

I thought it would be a nice experiment to train the model with some IR (reverb impulse response) dataset and get the model synthesize some arbitrary spaces for me... So, I prepared a folder contains 115 IR samples of real spaces such as tunnel, hall, convention center etc. I generated 20 second of IR sample after training the model up to 20000 steps by using the initial settings. This is what it gave me. https://soundcloud.com/pullahs/generated-irs/s-WQW2v?in=pullahs/sets/wavenet/s-sx05a which is quite interesting actually. (it worked...) I sliced the IR wave file and extract some IRs out of it then placed it on a convolutional reverb plugin (that's how you use the IRs) to see the resulting space. there is a straight forward playlist. https://soundcloud.com/pullahs/sets/wavenet/s-sx05a

What I learned from this is that it promises a lot... With a better conditioning and more specific design approach on coding (which I'm super bad..) this can lead up to something much more useful. Designing auditory spaces... The downside of my experiments is that this model only works in mono. To give a better perspective the model should work on stereo (at least). On "how could it be done (audiowise)" side of things, only thing that comes up to my mind is that while training the model it should also check the de-correlation between Left and Right channel. The phase difference between samples of Left channel and Right channel.

If the stereo implementation would be done or anyone would like to guide me on this, I have a lot to try out.

Best.

ljuvela commented 6 years ago

Cool idea, generated audio environments is definitely worth investigating!

Stereo (or multi-channel) is related to general multivariate modelling (see also https://github.com/ibab/tensorflow-wavenet/issues/308). Implementing multichannel will require code changes in various places, but it's not too complicated in principle (if you ignore cross-channel correlations 🤔).

Two audio channels is quite feasible (with e.g. 8 bits per channel), but by adding many channels for multivariate models the output layer size grows and memory consumption will become an issue.

One way to implement could be to use one-hot encoding for each audio channel and use a separate softmax layer for each channel in the model. The required chances would roughly be:

Add stereo support for the audio reader.
Change the mu-law, quantisation, and one-hot encoding functions accordingly to create stereo target values
For the model, change of dimensions is needed at inputs and outputs (everywhere where the model touches audio data)
To modify the training procedure: 1) Double the output volume of the final post-processing layer 2) Group the output into two softmax layers 3) Evaluate categorical cross entropy loss for each channel separately, and sum the losses for optimisation

Multichannel image modeling (with R/G/B color channels) is already a done in PixelCNN, which is otherwise very similar to WaveNet. In PixelCNN++ (https://github.com/openai/pixel-cnn/tree/master/pixel_cnn_pp), a logistic mixture density network approach was proposed to deal with issues of quantisation and one-hot amplitude encoding. They also used a weight-tying scheme to account for cross-channel dependencies (!)

Ultimately, I think the mixture density approach is the way to go, since assuming a continuous latent variable for the amplitudes makes a lot more sense than treating them as discrete classes.

pullahs commented 6 years ago

@ljuvela What do you think on conditioning wavenet to length of the training data (the IR files) ? By doing that can we give length for the generation process so that it can build an IR with the specified length ? For example when the IR length is around 6 seconds it is mostly hall type of reverb etc..

zegenerative commented 5 years ago

I am reading this and I am super intrigued by it. I was wondering if you continued to work on this and if you made any progress in the stereo implementation? Also the links don't work anymore. Hope to hear either from you or someone who picked this up.

ljuvela commented 5 years ago

Thanks for upping this, I had forgotten the thread. 😕 Also interested if anyone has done further experiments. (Been busy with other stuff like https://arxiv.org/abs/1811.00334, https://arxiv.org/abs/1804.09593)

There's another WaveNet repo that provides mixture density training plus local and global conditioning https://github.com/r9y9/wavenet_vocoder

As for conditioning room impulse response generation, one approach could be global conditioning with room metadata or maybe T60 (https://en.wikipedia.org/wiki/Reverberation#Reverberation_time). Another thing I think is worth trying would be local conditioning with the IR envelope for more precise control.

ibab / tensorflow-wavenet

Stereo Implementation #323