Open questions - Githubissues

The following questions arose when implementing the paper:

1. Regarding Post-Net In the Tacotron paper the authors cite, Post-Net is used after their so-called 'Synthesizer', the module that produces a mel-spectrogram from text. In their method, Post-Net is used after WaveNet. Did they empirically find that using Post-Net this way gives better results? Might this be explained simply due to more parameters being available to learn the mapping from noisy to clean speech?

2. Detailed specification of GLU blocks in SpecGan We did not manage to find detailed specifications of the GLU blocks used within the SpecGan discriminator. In some implementations found online, only one initial convolution is performed but with twice the number of channels, which are then split in half for the two pathways of the GLU block. In other implementations, two separate 1x1 convolutions were used on the output of the BatchNorm layer to produce inputs for the two pathways.

3. Data augmentation We were not able to reliably perform augmentation of impulse responses using Bryan's (2019) proposed method, with RT60 of augmented impulse responses deviating drastically from the desired RT60. They cite 'A WaveNet for Speech Denoising' (2018) in which speaker conditioning is used to train WaveNet. After augmenting speaker audio, did they use the speaker conditioning corresponding to the original speaker audio, or did they omit speaker conditioning when augmenting speakers? When changing the speed of speaker audio, did the pitch change as well? They essentially train the model in three phases (WaveNet; WaveNet + Post-Net; all modules) and distinguish between data simulation and data augmentation. To us, it is not entirely transparent which kind of data augmentation was used in the individual phases. For example, in the first phase when only WaveNet is trained, what did the data simulation and/or augmentation amount to?

Here are some of the answers I got from the authors of HiFi-GAN:

2. Detailed specification of GLU blocks in SpecGan

The response I got is:

For 2), they are equivalent mathematically. Either implementation is fine. In generally, one would prefer initial convolution with twice number of channels and then do split to increase some efficiency (see waveglow github for details) but as stated in our paper, we did two pathways instead of doing the split above.

3. Data augmentation With regards to speaker conditioning, section 2.1 indicates that no speaker conditioning is used so it can be disregarded. One of the authors confirmed this too.

For the first phase of training (aka Base in the paper), they used data simulation by convolving the RIR with clean speech from DAPS dataset and adding noise on top of the reverberated speeech. This allows for more samples than the ones provided by DAPS dataset alone.

The extra augmentation as documented in the HiFi-GAN paper is done start at phase two.

Here's an excerpt from an email conversation about the simulated dataset:

There are several parallel speech enhancement datasets out there. You can try the datasets we used in the paper. One is DAPS Dataset (https://archive.org/details/daps_dataset), which provides parallel recordings for real environments. Another one is noisy dataset created from VCTK (https://datashare.is.ed.ac.uk/handle/10283/2791?show=full) and its noisy reverberant version https://datashare.is.ed.ac.uk/handle/10283/2826). Also there are parallel datasets provided by the REVERB Challenge (https://reverb2014.dereverberation.com/data.html) and other speech enhancement challenges. The sizes of the existing datasets above (especially the number of environments) are limited, and thus we also create our own simulated dataset for training. We use "clean" speech set from DAPS Dataset (https://archive.org/details/daps_dataset), RIRs from MIT IR Survey (https://mcdermottlab.mit.edu/Reverb/IR_Survey.html), and noise samples from the REVERB Challenge (https://reverb2014.dereverberation.com/data.html) and the ACE Challenge (http://www.ee.ic.ac.uk/naylor/ACEweb/index.html). Then you can randomly mix speech, RIR and noise (the Reverb Challenge provides script for mixing them and you can look into it to get an idea). For extra data augmentation, we 1) randomly resample speech audio, RIR and noise to make them faster or slower 2) rescale the energy of RIR's direct signal (you can check this paper for how to do it: https://arxiv.org/abs/1909.03642), and 3) apply multi-band filter on noise and on final mixed result to have various coloration. The data augmentation helps the model to be more robust to new environments, but it should already give you reasonable result even without this data augmentation step.

w-transposed-x / hifi-gan-denoising

Open questions #1