FastMDCT4
)Firstly, for excessively long speech audio file, we recommend that you remove long gaps and split it into smaller segments. Other than this no other pre-processing is required, the program will automatically sample a random section from the longer audio file.
It also automatically resamples the high sample rate audio to the low sample rate and upsamples it again to the target sample rate. This process simulates the loss of speech after downsampling. And up-sampling again aligns the low-res audio with the original high sample rate audio. So you don't need to manually resample the original audio.
Secondly, Prepare your dataset index file like this (VCTK dataset example):
wav48/p250/p250_328.wav
wav48/p310/p310_345.wav
wav48/p227/p227_020.wav
wav48/p285/p285_050.wav
wav48/p248/p248_011.wav
wav48/p246/p246_030.wav
wav48/p247/p247_191.wav
wav48/p287/p287_127.wav
wav48/p334/p334_220.wav
wav48/p340/p340_414.wav
wav48/p236/p236_231.wav
wav48/p301/p301_334.wav
...
Save it to the root directory of your dataset as a text file and the program will splice the parent folder of index file with the relative path of the records in the file. You can also find the index file used in our experiments in data/train.csv
.
Modify & run sh train.sh
. Detailed explanation of args can be found in options/base_options.py
and options/train_options.py
Parameter Name | Description |
---|---|
--name | Name of the experiment. It decides where to store samples and models. |
--dataroot | Path to your train set csv file. |
--evalroot | Path to your eval set csv file. |
--lr_sampling_rate | Input Low-res sampling rate. It will be automatically resampled to this value. |
--sr_sampling_rate | Target super-resolution sampling rate. |
--fp16 | Train with Automatic Mixed Precision (AMP). |
--nThreads | Number of threads for loading data. |
--lr | Initial learning rate for the Adam optimizer. |
--arcsinh_transform | Use $\log(x+\sqrt{x^2+1})$ to compress the range of input. |
--abs_spectro | Use the absolute value of the spectrogram. |
--arcsinh_gain | Gain parameter for the arcsinh_transform. |
--center | Centered MDCT. |
--norm_range | Specify the target distribution range. |
--abs_norm | Assume the spectrograms are all distributed in a fixed range. Normalize by an absolute range. |
--src_range | Specify the source distribution range. Used when --abs_norm is specified. |
--netG | Select the model to use for netG. |
--ngf | Number of generator filters in the first conv layer. |
--n_downsample_global | Number of downsampling layers in netG. |
--n_blocks_global | Number of residual blocks in the global generator network. |
--n_blocks_attn_g | Number of attention blocks in the global generator network. |
--dim_head_g | Dimension of attention heads in the global generator network. |
--heads_g | Number of attention heads in the global generator network. |
--proj_factor_g | Projection factor of attention blocks in the global generator network. |
--n_blocks_local | Number of residual blocks in the local enhancer network. |
--n_blocks_attn_l | Number of attention blocks in the local enhancer network. |
--fit_residual | If specified, fit $HR-LR$ than directly fit $HR$. |
--upsample_type | Select upsampling layers for netG. Supported options: interpolate, transconv. |
--downsample_type | Select downsampling layers for netG. Supported options: resconv, conv. |
--num_D | Number of discriminators to use. |
--eval_freq | Frequency of evaluating metrics. |
--save_latest_freq | Frequency of saving the latest results. |
--save_epoch_freq | Frequency of saving checkpoints at the end of epochs. |
--display_freq | Frequency of showing training results on screen. |
--tf_log | If specified, use TensorBoard logging. Requires TensorFlow installed. |
Modify & run sh gen_audio.sh
.
This code repository refers heavily to the official pix2pixHD implementation. Also, this work is based on an improved version of my undergraduate Final Year Project, see: pix2pixHDAudioSR
Try FastMDCT4
/FastIMDCT4
in models/mdct.py
to have faster MDCT conversion. You can use FastMDCT4
as an in-place replacement for MDCT4
, or modify the import statement in models/pix2pixHD_model.py
to from .mdct import FastMDCT4 as MDCT4, FastIMDCT4 as IMDCT4
On my computer (RTX3070 laptop, Intel Core i7 11800H), each forward transformation saves 2ms.
sig = torch.randn(64,32512, device='cuda')
%timeit -r 20 -n 500 mdct(sig)
# 9.61 ms ± 643 µs per loop (mean ± std. dev. of 20 runs, 500 loops each)
%timeit -r 20 -n 500 fast_mdct(sig)
# 7.68 ms ± 691 µs per loop (mean ± std. dev. of 20 runs, 500 loops each)