tky823 / DNN-based_source_separation

A PyTorch implementation of DNN-based source separation.
290 stars 50 forks source link
audio-separation conv-tasnet pytorch source-separation speech-separation tasnet

DNN-based source separation

A PyTorch implementation of DNN-based source separation.

New information

Model

Model Reference Done
WaveNet WaveNet: A Generative Model for Raw Audio
Wave-U-Net Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation
Deep Clustering Deep Clustering: Discriminative Embeddings for Segmentation and Separation
Deep Clustering++ Single-Channel Multi-Speaker Separation using Deep Clustering
Chimera Alternative Objective Functions for Deep Clustering
DANet Deep Attractor Network for Single-microphone Apeaker Aeparation
ADANet Speaker-independent Speech Separation with Deep Attractor Network
TasNet TasNet: Time-domain Audio Separation Network for Real-time, Single-channel Speech Separation
Conv-TasNet Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
DPRNN-TasNet Dual-path RNN: Efficient Long Sequence Modeling for Time-domain Single-channel Speech Separation
Gated DPRNN-TasNet Voice Separation with an Unknown Number of Multiple Speakers
FurcaNet FurcaNet: An End-to-End Deep Gated Convolutional, Long Short-term Memory, Deep Neural Networks for Single Channel Speech Separation
FurcaNeXt FurcaNeXt: End-to-End Monaural Speech Separation with Dynamic Gated Dilated Temporal Convolutional Networks
DeepCASA Divide and Conquer: A Deep Casa Approach to Talker-independent Monaural Speaker Separation
Conditioned-U-Net Conditioned-U-Net: Introducing a Control Mechanism in the U-Net for multiple source separations
MMDenseNet Multi-scale Multi-band DenseNets for Audio Source Separation
MMDenseLSTM MMDenseLSTM: An Efficient Combination of Convolutional and Recurrent Neural Networks for Audio Source Separation
Open-Unmix (UMX) Open-Unmix - A Reference Implementation for Music Source Separation
Wavesplit Wavesplit: End-to-End Speech Separation by Speaker Clustering
Hydranet Hydranet: A Real-Time Waveform Separation Network
Dual-Path Transformer Network (DPTNet) Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation
CrossNet-Open-Unmix (X-UMX) All for One and One for All: Improving Music Separation by Bridging Networks
D3Net D3Net: Densely connected multidilated DenseNet for music source separation
LaSAFT LaSAFT: Latent Source Attentive Frequency Transformation for Conditioned Source Separation
SepFormer Attention is All You Need in Speech Separation
GALR Effective Low-Cost Time-Domain Audio Separation Using Globally Attentive Locally Reccurent networks
HRNet Vocal Melody Extraction via HRNet-Based Singing Voice Separation and Encoder-Decoder-Based F0 Estimation
MRX The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks

Modules

Module Reference Done
Depthwise-separable convolution Xception: Deep Learning with Depthwise Separable Convolutions
Gated Linear Units (GLU) Language Modeling with Gated Convolutional Networks
Sigmoid Linear Units (SiLU) Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning
Feature-wise Linear Modulation (FiLM) FiLM: Visual Reasoning with a General Conditioning Layer
Point-wise Convolutional Modulation (PoCM) LaSAFT: Latent Source Attentive Frequency Transformation for Conditioned Source Separation

Method related to training

Method Reference Done
Pemutation invariant training (PIT) Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks
One-and-rest PIT Recursive Speech Separation for Unknown Number of Speakers
Probabilistic PIT Probabilistic Permutation Invariant Training for Speech Separation
Sinkhorn PIT Towards Listening to 10 People Simultaneously: An Efficient Permutation Invariant Training of Audio Source Separation Using Sinkhorn's Algorithm
Combination Loss All for One and One for All: Improving Music Separation by Bridging Networks

Example

Open In Colab

LibriSpeech example using Conv-TasNet

You can check other tutorials in <REPOSITORY_ROOT>/egs/tutorials/.

0. Preparation

cd <REPOSITORY_ROOT>/egs/tutorials/common/
. ./prepare_librispeech.sh \
--librispeech_root <LIBRISPEECH_ROOT> \
--n_sources <#SPEAKERS>

1. Training

cd <REPOSITORY_ROOT>/egs/tutorials/conv-tasnet/
. ./train.sh \
--exp_dir <OUTPUT_DIR>

If you want to resume training,

. ./train.sh \
--exp_dir <OUTPUT_DIR> \
--continue_from <MODEL_PATH>

2. Evaluation

cd <REPOSITORY_ROOT>/egs/tutorials/conv-tasnet/
. ./test.sh \
--exp_dir <OUTPUT_DIR>

3. Demo

cd <REPOSITORY_ROOT>/egs/tutorials/conv-tasnet/
. ./demo.sh

Pretrained Models

You need gdown to download pretrained models.

pip install gdown

You can load pretrained models.

from models.conv_tasnet import ConvTasNet

model = ConvTasNet.build_from_pretrained(task="musdb18", sample_rate=44100, target="vocals")

See PRETRAINED.md, egs/tutorials/hub/pretrained.ipynb or click Open In Colab for details.

Time Domain Wrappers for Time-Frequency Domain Models

See egs/tutorials/hub/time-domain_wrapper.ipynb or click Open In Colab.

Speech Separation by Pretrained Models

See egs/tutorials/hub/speech-separation.ipynb or click Open In Colab.

Music Source Separation by Pretrained Models

See egs/tutorials/hub/music-source-separation.ipynb or click Open In Colab.

If you want to separate your own music file, see below: