santi-pdp / segan

Speech Enhancement Generative Adversarial Network in TensorFlow
MIT License
816 stars 281 forks source link

Negative results #3

Closed keunwoochoi closed 7 years ago

keunwoochoi commented 7 years ago

Hi, nice paper! and thanks for sharing codes.

I'd like to ask you if you can share some negative results while designing/selecting hyperparameters. For example, why PReLU? What was the result like when you used LReLU or ReLU? How was filter width of 31 decided? I think it would be hugely valuable especially because training GAN is quite tricky and could be task/data specific.

santi-pdp commented 7 years ago

Hi there, and thanks!

regarding instability (divergence and D_loss not going to zero), we pretty much solved it with LSGAN w/ RMSProp, virtual batch norm and low learning rates (after extensive experimentation with GANs in simplified domains). Regarding non-linear activations: first we used LeakyReLU both in G and D, but we heard some high freq noises and speech distortions (and SEGANv1 still has some high freq leak from the input noises sometimes). We thought it might be better to have an intermediate thing between ReLUs (quick and stable in existing GANs) and LeakyReLUs that could be learnable (PReLU) in G, as it is critical to have enough power to generate well-cleaned samples, whilst we left D as it was because its cost always drops quite low so it looks like it's powerful enough. An interesting thing to see is the distribution of alpha vectors (PReLU coeffs) in every G layer, so we can see how the first layer gets high leaky coefficients (which is more similar to original LeakyRelu effect) but the subsequent layers have reduced factors closer to ReLU effect (which turned out to be stable and quick learning in other works). For example the next figures are different G-encoder layers over time (Tensorboard):

alpha_first_4_layers alpha_interm_4_layers

Things made sense in this non-linearity direction. Now regarding the kernel width, first it was 5 but we heard low frequency noises in early stage of the project for some training samples so we increased the size to have larger receptive fields from upper layers from the waveform (more samples --> lower frequencies can be treated in case freq. decomposition is done in the net), and we also increased the depth for G-enc and G-dec (so even larger). This is somehow aligned with recent research in music processing with convnets I guess, where it is claimed that computer vision filter designs do not match audio analysis purposes, and in audio we usually work with information related to spectrum decomposition, so our intuition was "the larger the window, more freq resolution in case it needs it". In the end of the day we could not do extensive architecture search to test many variations as we had limited GPUs to train each model (1.5 day roughly per model), but we had clues (ReLU vs LeakyReLU) and some prior expectations and knowledge (filter widths, depth). To conclude I have to say that we have recent improvements that came out after the SEGANv1, which enhance better and get rid of previous annoying artifacts (preliminary tests point this). We will open samples and code soon.

keunwoochoi commented 7 years ago

Oops, once read from phone and forgot to reply. Thanks for your very detailed tips! 👍 👍