Tensorflow implementation of "Speaker-Independent Speech Separation with Deep Attractor Network"
Link to original paper
STILL WORK IN PROGRESS, EXPECT BUGS
numpy / scipy
tensorflow >= 1.2
matplotlib (optional, for visualization)
h5py / fuel (optional, for certain datasets)
Currently, TIMIT and WSJ0 datasets are implemented. You can use the "toy" dataset for debugging. It just some white noise.
Follow app/datasets/TIMIT/readme
for dataset preparation.
Follow app/datasets/WSJ0/readme
for dataset preparation.
After setting up a dataset, you may want to change DATASET_TYPE
in hyperparameters.
This is to change batch size, learning rate, dataset type etc ...
There's a default.json
file at the root directory. You make your own and change
some of the values. For example you can create a JSON file with:
{
DATASET_TYPE="timit",
LR=1e-2,
BATCH_SIZE=8
}
Save it as my_setup.json
, now you can run the script with:
python main.py -c my_setup.json
Some commonly used hyperparameters can be overridden by CLI args.
For example, to set learning rate:
python main.py -lr=1e-2
Here's a incomplete list of them:
# set learning rate, overrides LR
-lr
--learn-rate
# set dataset to use, overrides DATASET_TYPE
-ds
--dataset
# set batch size, overrides
-bs
--batch-size
# set
Note If you get out of memory (OOM) error from tensorflow, you can try using a lower BATCH_SIZE
.
Note If you change FFT_SIZE
, FFT_STRIDE
, FFT_WND
, SMP_RATE
,
you should do dataset preprocessing again.
Note If you change model architecture, the previously saved model parameter may not be compatible.
Under the root directory of this repo:
python main.py -ds='timit'
python main.py -c my_setup.json
python main.py -ne=100 -o='params.ckpt'
python main.py -ne=100 -i='params.ckpt' -o='params.ckpt'
python main.py -i='params.ckpt' -m=test
$ python main.py -i='params.ckpt' -m=demo
$ ls *.wav
demo.wav demo_separated_1.wav demo_separated_2.wav
$ python main.py -i='params.cpkt' -m=demo -if=file.wav
$ ls *.wav
file.wav file_separated_1.wav file_separated_2.wav
tensorboard --logdir=./logs/`
python main.py --help
Make a file app/datasets/my_dataset.py
.
Make a subclass of app.datasets.dataset.Dataset
@hparams.register_dataset('my_dataset')
class MyDataset(Dataset):
...
You can use app/datasets/timit.py
as an reference.
app/datasets/__init__.py
, add: import app.datasets.my_dataset
DATASET_TYPE
to "my_dataset"
in JSON config fileYou can make subclass of Estimator
, Encoder
, or Separator
to tweak model.
Encoder
is for getting embedding from log-magnitude spectra.
Estimator
is for estimating attractor points from embedding.
Separator
uses mixture spectra, mixture embedding and attractor to get separated spectra.
You can set encoder type by setting ENCODER_TYPE
in hyperparameters.
You can set estimator type by setting
TRAIN_ESTIMATOR_METHOD
and INFER_ESTIMATOR_METHOD
in hyperparameters.
You can set separator type by setting SEPARATOR_TYPE
in hyperparameters.
Make sure to use @register_*
decorator for your class.
See code in app/modules.py
for details. There are existing sub-modules.
To change overall model architecture, modify Model.build()
in main.py
Only the favorable "anchor"
method for estimating attractor location during inference is implemented.
During training, it's also possible to use ground truth to give attractor location.
TIMIT dataset is small, so we use same set for test and validation.
We use WSJ0 si_tr_s
/ si_dt_05
/ si_et_05
subsets as training / validation / test set respectively.
The speakers are randomly chosen and mixed at runtime.
This setup is slightly different to orignal paper.
Only single GPU training is implemented.
Doesn't work on Windows.