FFT/IFFT instead of u-law encoding

nakosung commented 7 years ago

https://www.youtube.com/watch?v=NYDeH-knnAI

Is it worth trying with FFT/IFFT? Most audio signal processing involves FFT/IFFT so I think it is natural to process with frequency domain. What do you think about this approach?

ibab commented 7 years ago

I don't think the network will be able to do a good job at predicting the next FFT sample from all previous ones, but it might be worth trying. I think that WaveNet is quite specific to sequence prediction, considering the way we train/generate and the causality of the filter.

Edit: Thanks for pointing out that a spectrogram was meant. This makes more sense.

lemonzi commented 7 years ago

I think he means a spectrogram. Notice that in that case we would be doing multivariate regression, not classification, so the loss function would have to be adjusted.

The whole point of this network, though, is that it can extract a meaningful representation from raw audio -- we usually use FFT and spectrograms because it's the best we know, but they are destructive because we discard the phase and introduce a lot of artifacts because of the windowing and the time-frequency duality.

It's worth a shot if you feel like playing with the model, though!

El dt., 4 oct. 2016 a les 11:42, Igor Babuschkin (notifications@github.com) va escriure:

I don't think the network will be able to do a good job at predicting the next FFT sample from all previous ones, but it might be worth trying. I think that WaveNet is quite specific to sequence prediction, considering the way we train/generate and the causality of the filter.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/119#issuecomment-251426892, or mute the thread https://github.com/notifications/unsubscribe-auth/ADCF5iR7kWqfVY1IbywMUz5hVnyjrOGGks5qwnPigaJpZM4KNsGW .

nakosung commented 7 years ago

I agree with the artifacts IFFT would introduce so if we feed FFT'd frame as an additional input data(spectrogram), I think it is possible that network can capture more meaningful information which cannot be detected with receptive field. (it also could be done by receptive field with matching size)

btw, (kind of off-topic) as wavenet proves its power of signal processing and reconstruction, could we apply same technique to motion synthesis which is described in above video?

lemonzi commented 7 years ago

I think the receptive fields we are using now are already as large as an FFT frame, and if not they should be.

Isn't that new paper using for motion synthesis using a similar concept? It wouldn't be a WaveNet anymore (I would restrict "WaveNet" to the original paper, about auto-regressing a 1D time-series using a cascade of dilated convolutions, skip connections, etc., and maybe with a one-hot input). But I agree the concept of using dilated convolutions for time-series modelling rather than the classic recurrent units / LSTM is very promising.

nakosung commented 7 years ago

Closing this issue.

Cortexelus commented 7 years ago

but they are destructive because we discard the phase

List of Papers on Phase Recovery

lemonzi commented 7 years ago

These are all approximations that enforce different sets of constraints.

On Fri, Oct 7, 2016, 22:42 CJ Carr notifications@github.com wrote:

but they are destructive because we discard the phase

List of Papers on Phase Recovery https://www.evernote.com/shard/s260/sh/72efd25c-491c-4a8a-a8db-aa2d6959ee92/1d6b05ae86f948d3

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/ibab/tensorflow-wavenet/issues/119#issuecomment-252397778, or mute the thread https://github.com/notifications/unsubscribe-auth/ADCF5spZldVas5uDcVxW7nDvuA2dAkByks5qxwMpgaJpZM4KNsGW .

nakosung commented 7 years ago

@Cortexelus @lemonzi Could real + imaginary numbers maintain phase information? If so, we could feed Real+Im which are transformed by FFT into wavenet.

Cortexelus commented 7 years ago

Real+imaginary maintain phase, yes. Think polar geometry. If your complex number is 12 + 5i, phase is the angle θ, magnitude is the absolute value r

Polar geometry

Likely better results with (magnitude, phase) than (real, imag) because magnitudes are more strongly correlated among each other vertically (same frame, different bin) and horizontally (same bin, different frame).

You could also try (magnitude, delta phase) or pairs of (instantaneous frequency, magnitude). This may help you more easily exploit correlations among frequencies in steady (harmonic) signals. But it may have trouble with transients (percussion, onsets). Absolute phase matters in transients. The delta phase (difference in phase between frames) can be used to calculate instantaneous frequency. A great explanation of this is here Pitch shifting using the FT.

Also not every spectrogram has a true waveform that corresponds to it. If you generate an untrue spectrogram, the iFFT may give you something close with artifacts. Sometimes a phase-recovery method corrects it. The simplest phase recovery method (Griffin Lim) iterates iFFT>FFT>iFFT>FFT>iFFT while enforcing the magnitude to be constant over iterations.

ibab / tensorflow-wavenet

FFT/IFFT instead of u-law encoding #119