Audlib Demo

Demo is on ~~January 22~~ January 29, 2019 in our ROBUST-MLSP group meeting.

Outlines

Motivation/Why?
Functionalities
Contributors
Difference between this package and existing packages
1. Pythonic library
2. Lazy evaluation
Demo
1. Easy to use interface
  1. Feature extraction
  2. Data preprocessing (Add multi-threading to audiopipe.py)
    1. HPC
  3. Pytorch compatible dataset
2. Performance compared to librosa
  1. Optimization (mfcc computation,...etc.)
Roadmap
Contributing

Motivation/What is pyaudlib?

Pyaudlib is a speech processing library in Python with emphasis on deep learning.

Popular speech/audio processing libraries have no deep learning support:

librosa
voicebox
...

Generic deep learning libraries have good image processing support, but not for audio:

PyTorch
TensorFlow
...

pyaudlib (name subject to change) provides a collection of utilities for developing speech-related applications using both signal processing and deep learning.

Functionalities

pyaudlib offers the following high-level features:

Speech signal processing utilities with ready-to-use applications
- Feature extraction frontend
- Speech enhancement
- Speech activity detection
Deep learning architectures for speech processing tasks in PyTorch
- SNRNN (and its variant) for speech enhancement*
- Attention network + CTC objective for speech recognition*
PyTorch-compatible interface (similar to torchvision) for batch processing
- Dataset class specific to speech tasks
  - For ASR: WSJ0, WSJ1
  - For speech enhancement: RATS, VCTK
  - For speech activity detection: RATS
I/O utilities for interfacing with CMUSPHINX*
A command-line interface with a unix-pipe-like syntax
- Inspecting the spectrogram of a wave file is as easy as
```
audiopipe open -i path/to/audio.wav read logspec plot
```

*Under development.

Difference between pyaudlib and existing libraries

Correctness

Unit testing is done on all signal processing functions

User inputs are checked for correctness


>>> wind = hamming(512, hop=.75, synth=True)
AssertionError: [wsize:512, hop:0.75] violates COLA in time.

wind = hamming(512, hop=.5, synth=True) # ok!

No unexpected output


>>> # Using audlib
>>> sig, sr = audioread('samples/welcome16k.wav')
>>> sigspec = stft(sig, sr, wind, .5, 512, synth=True)
>>> sigsynth = istft(sigspec, sr, wind, .5, 512)
>>> np.allclose(sig, sigsynth[:len(sig)])
True

Using librosa (you might not expect this)*

nfft = 512 sigpad = fix_length(sig, len(sig)+nfft//2) D = stft(sigpad, n_fft=nfft) sigsynth = istft(D, length=len(sig)) np.allclose(sig, sigsynth) False np.sum(np.abs(sig-sigsynth)) 0.00012380785899053157
*This is the [official example](https://librosa.github.io/librosa/generated/librosa.core.istft.html) given by librosa.

Efficiency
- All functionalities are profiled in terms of time and space complexity.
- Frequently used utilities are already up to speed with popular libraries
```
>>> %timeit stft_audlib(sig, sr, hamming(int(window_length*sr), hop=hopfrac), hopfrac, nfft)
628 µs ± 2.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit stft_librosa(sig, n_fft=nfft, hop_length=int(window_length*sr*hopfrac), win_length=int(window_length*sr))
757 µs ± 2.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit melspec_audlib(sig, sr, wind, hopfrac, nfft, MelFreq(sr, nfft, nmels))
>>> 1.07 ms ± 4.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit melspec_librosa(S=np.abs(stft_librosa(sig, n_fft=nfft, hop_length=int(window_length*sr*hopfrac), win_length=int(wi
...: ndow_length*sr)))**2, n_mels=nmels)
>>> 1.52 ms ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
- Memory footprint still has room to improve
  - Memory usage affects speed for processing of large amount of data.
```
## For AUDLIB ##
434476 frames processed by stft_audlib in 36.76 seconds.
```
Line # Mem usage Increment Line Contents

58 145.5 MiB 145.5 MiB @profile 59 def test_transform(transform): 60 """Test time spent for a transform to process a dataset.""" 61 145.5 MiB 0.0 MiB start_time = time.time() 62 145.5 MiB 0.0 MiB numframes = 0 63 145.5 MiB 0.0 MiB idx = 0 if transform.name.endswith('audlib') else 1 64 180.8 MiB 1.5 MiB for ii, samp in enumerate(wsjspeech): 65 180.8 MiB 0.0 MiB if not ((ii+1) % 100): 66 172.3 MiB 0.0 MiB print(f"Processing [{ii+1}/{len(wsjspeech)}] files.") 67 180.8 MiB 8.4 MiB feat = transform(wsjspeech[ii]) 68 180.8 MiB 0.0 MiB numframes += feat.shape[idx] 69 180.8 MiB 0.0 MiB if (ii+1) > 500: 70 171.0 MiB 0.0 MiB break 71 171.0 MiB 0.0 MiB print(f"""{numframes} frames processed by {transform.name} in {time.time()- start_time:.2f} seconds.""")

For LIBROSA

434479 frames processed by stft_librosa in 36.07 seconds.

Line # Mem usage Increment Line Contents

58 148.6 MiB 148.6 MiB @profile 59 def test_transform(transform): 60 """Test time spent for a transform to process a dataset.""" 61 148.6 MiB 0.0 MiB start_time = time.time() 62 148.6 MiB 0.0 MiB numframes = 0 63 148.6 MiB 0.0 MiB idx = 0 if transform.name.endswith('audlib') else 1 64 166.4 MiB 1.0 MiB for ii, samp in enumerate(wsjspeech): 65 166.4 MiB 0.0 MiB if not ((ii+1) % 100): 66 164.8 MiB 0.0 MiB print(f"Processing [{ii+1}/{len(wsjspeech)}] files.") 67 166.4 MiB 7.0 MiB feat = transform(wsjspeech[ii]) 68 166.4 MiB 0.0 MiB numframes += feat.shape[idx] 69 166.4 MiB 0.0 MiB if (ii+1) > 500: 70 164.1 MiB 0.0 MiB break 71 164.1 MiB 0.0 MiB print(f"""{numframes} frames processed by {transform.name} in {time.time()- start_time:.2f} seconds.""")
```
- A note on programming pattern in our group

We have seen three patterns for pre-processing audio data before feeding them into a NN:
1. Extract all features and save to a file, then load them all at once when needed.
    - Maximum disk space
    - Unacceptable usage of memory
    - Extremely slow runtime
2. Extract and save each feature to a separate file, then load them when needed.
    - Maximum disk space
    - Minimal memory footprint, given that features are loaded on-demand
    - Fastest runtime 
3. Extract features on-the-fly.
    - No disk space
    - Very small memory footprint
    - Moderate runtime
```
Simplicity
- Syntax is simple but does not over-simplify
- Dataset creation complies to torchvision style
- When in doubts, there are example IPython notebooks to reference
Continuous development (for developers)
- Codebase is written and documented according to industry standard (PEP 8, NumPy docstring guide)
- Continuous integration (MAHMOUD: Add something here)
- No high-level dependencies. Credible low-level dependencies are included when absolutely required:
  - PyTorch for DNN implementations and GPU calculation
  - NumPy for multi-dimensional array computation
  - Click for command-line interface
  - SoundFile for audio I/O
  - resampy for resampling*
  - SciPy for filtering*
  - Matplotlib for plotting
*Will be removed in the future.

Roadmap

Top-priority stack (before March):

A short-time analysis NN layer
- Frame-level computations (e.g. STFT, MFCC) inside NNs
Attention network and CTC objective for ASR
A character-level DNN-based speech recognition system
Multi-threaded feature extraction

Mid-priority stack (before April):

SNRNNpost for speech enhancement
I/O bridge with SPHINX's language model
Integrating (recent) work that came out of our group
- Phase difference channel weighting (PDCW)
- Suppression of Slowly-varying components and the Falling edge of the power envelope (SSF)
- Cross-correlation across frequency (CCF)

Other ideas:

Local implementation of frequently used applications
- F0 tracker
- Phase vocoder
- STFT phase estimation given magnitude
- DNN-based force aligner

Contributing

Current contributors (at least pushed to repo once):

Raymond Xia - yangyanx@andrew.cmu.edu

Mahmoud Alismail - mahmoudi@andrew.cmu.edu

Shangwu Yao - shangwuyao@gmail.com

Joining the developement team, reporting issues, or requesting features are all welcome!

raymondxyy / pyaudlib

Demo structure #10