ibab / tensorflow-wavenet

A TensorFlow implementation of DeepMind's WaveNet paper
MIT License
5.41k stars 1.3k forks source link

WaveNet for the separation of audio sources #163

Open kaihuchen opened 7 years ago

kaihuchen commented 7 years ago

The WaveNet paper mentioned very briefly (without any details) the possibility of using it for audio source separation. Examples of this include: extract the vocal part out of a piece of music, extract the voice of a specific person even if there are many people talking over each other, etc.

I thought this seems feasible in principle, since the generative nature of WaveNet allows it to learn the characteristics of a specific audio source without over generalization, which then makes it easier to locate and extract the target audio even if it is partially occluded by other audio sources. However, I thought that this cannot done in the time domain (i.e., an audio as a sequence of numbers along the time axis), but rather it has to be done in the frequency domain (i.e., an audio as a spectral sequence which is a sequence of vectors, like what you get after Fourier Transform). How's so? This is because multiple time-domain signals can cancel each other out partially, and the shape of the wave form is subject to the vagary of phase shift. One the other hand, overlaying multiple signal sources in the frequency domain (without the phase component) is simply additive, which makes things much simpler.

To summarize, I think WaveNet can be great for doing audio source separation, but this will require extending the current WaveNet implementation a little to handle vector sequences (i.e., the spectral sequence of audios).

Your thoughts?

lemonzi commented 7 years ago

The idea of inputting magnitude-only spectral frames rather than raw audio has been suggested a few times already here in the issues. I'm not a big fan of it because the point of the WaveNet is that it works with raw audio, which can use all that phase information that is usually discarded and without the artifacts from the windowing.

However, it's totally feasible and surely worth a shot -- the network actually already accepts vector sequences as input, which are now one-hot encoded samples but could be anything else. You could even leave the same loss at the output, similarly to how scalar_input works. The only addition would then be the calculation of the spectrogram (or whichever time-frequency representation we choose). Leaving the same loss, though, wouldn't probably work because of the phase shifts that would be inserted with the phase destruction and reconstruction.

kaihuchen commented 7 years ago

@lemonzi To clarify, I also thought the fact that WaveNet is able to work with raw audio is amazing, which was actually what attracted me to it in the first place, and I would not want to convert it to train on spectral data unless there are some clear benefits to it. What I stated above was my strawman suggestion that audio separation would appear to be such a case, and I look forward to someone in the community to shoot it down.

Due to the removal of phase information during training, the loss should be computed based on spectral magnitude only. This means that even if the system is trained to zero loss, the reconstructed audio likely will look different in the time domain due to different phase information. But luckily since human ears are not sensitive to phase information they should sound identical.