Demo is on January 22 January 29, 2019 in our ROBUST-MLSP group meeting.
Outlines
Motivation/Why?
Functionalities
Contributors
Difference between this package and existing packages
Pythonic library
Lazy evaluation
Demo
Easy to use interface
Feature extraction
Data preprocessing (Add multi-threading to audiopipe.py)
HPC
Pytorch compatible dataset
Performance compared to librosa
Optimization (mfcc computation,...etc.)
Roadmap
Contributing
Motivation/What is pyaudlib?
Pyaudlib is a speech processing library in Python with emphasis on deep learning.
Popular speech/audio processing libraries have no deep learning support:
librosa
voicebox
...
Generic deep learning libraries have good image processing support, but not for audio:
PyTorch
TensorFlow
...
pyaudlib (name subject to change) provides a collection of utilities for developing speech-related applications using both signal processing and deep learning.
Functionalities
pyaudlib offers the following high-level features:
Speech signal processing utilities with ready-to-use applications
Feature extraction frontend
Speech enhancement
Speech activity detection
Deep learning architectures for speech processing tasks in PyTorch
SNRNN (and its variant) for speech enhancement*
Attention network + CTC objective for speech recognition*
PyTorch-compatible interface (similar to torchvision) for batch processing
Memory usage affects speed for processing of large amount of data.
## For AUDLIB ##
434476 frames processed by stft_audlib in 36.76 seconds.
Line # Mem usage Increment Line Contents
58 145.5 MiB 145.5 MiB @profile
59 def test_transform(transform):
60 """Test time spent for a transform to process a dataset."""
61 145.5 MiB 0.0 MiB start_time = time.time()
62 145.5 MiB 0.0 MiB numframes = 0
63 145.5 MiB 0.0 MiB idx = 0 if transform.name.endswith('audlib') else 1
64 180.8 MiB 1.5 MiB for ii, samp in enumerate(wsjspeech):
65 180.8 MiB 0.0 MiB if not ((ii+1) % 100):
66 172.3 MiB 0.0 MiB print(f"Processing [{ii+1}/{len(wsjspeech)}] files.")
67 180.8 MiB 8.4 MiB feat = transform(wsjspeech[ii])
68 180.8 MiB 0.0 MiB numframes += feat.shape[idx]
69 180.8 MiB 0.0 MiB if (ii+1) > 500:
70 171.0 MiB 0.0 MiB break
71 171.0 MiB 0.0 MiB print(f"""{numframes} frames processed by {transform.name} in {time.time()-
start_time:.2f} seconds.""")
For LIBROSA
434479 frames processed by stft_librosa in 36.07 seconds.
Line # Mem usage Increment Line Contents
58 148.6 MiB 148.6 MiB @profile
59 def test_transform(transform):
60 """Test time spent for a transform to process a dataset."""
61 148.6 MiB 0.0 MiB start_time = time.time()
62 148.6 MiB 0.0 MiB numframes = 0
63 148.6 MiB 0.0 MiB idx = 0 if transform.name.endswith('audlib') else 1
64 166.4 MiB 1.0 MiB for ii, samp in enumerate(wsjspeech):
65 166.4 MiB 0.0 MiB if not ((ii+1) % 100):
66 164.8 MiB 0.0 MiB print(f"Processing [{ii+1}/{len(wsjspeech)}] files.")
67 166.4 MiB 7.0 MiB feat = transform(wsjspeech[ii])
68 166.4 MiB 0.0 MiB numframes += feat.shape[idx]
69 166.4 MiB 0.0 MiB if (ii+1) > 500:
70 164.1 MiB 0.0 MiB break
71 164.1 MiB 0.0 MiB print(f"""{numframes} frames processed by {transform.name} in {time.time()-
start_time:.2f} seconds.""")
- A note on programming pattern in our group
We have seen three patterns for pre-processing audio data before feeding them into a NN:
1. Extract all features and save to a file, then load them all at once when needed.
- Maximum disk space
- Unacceptable usage of memory
- Extremely slow runtime
2. Extract and save each feature to a separate file, then load them when needed.
- Maximum disk space
- Minimal memory footprint, given that features are loaded on-demand
- Fastest runtime
3. Extract features on-the-fly.
- No disk space
- Very small memory footprint
- Moderate runtime
Simplicity
Syntax is simple but does not over-simplify
Dataset creation complies to torchvision style
When in doubts, there are example IPython notebooks to reference
Audlib Demo
Demo is on
January 22January 29, 2019 in our ROBUST-MLSP group meeting.Outlines
Motivation/What is pyaudlib?
Pyaudlib is a speech processing library in Python with emphasis on deep learning.
Popular speech/audio processing libraries have no deep learning support:
Generic deep learning libraries have good image processing support, but not for audio:
pyaudlib (name subject to change) provides a collection of utilities for developing speech-related applications using both signal processing and deep learning.
Functionalities
pyaudlib offers the following high-level features:
Dataset
class specific to speech tasks*Under development.
Difference between pyaudlib and existing libraries
Correctness
Unit testing is done on all signal processing functions
User inputs are checked for correctness
Efficiency
All functionalities are profiled in terms of time and space complexity.
Frequently used utilities are already up to speed with popular libraries
Memory footprint still has room to improve
Line # Mem usage Increment Line Contents
58 145.5 MiB 145.5 MiB @profile 59 def test_transform(transform): 60 """Test time spent for a transform to process a dataset.""" 61 145.5 MiB 0.0 MiB start_time = time.time() 62 145.5 MiB 0.0 MiB numframes = 0 63 145.5 MiB 0.0 MiB idx = 0 if transform.name.endswith('audlib') else 1 64 180.8 MiB 1.5 MiB for ii, samp in enumerate(wsjspeech): 65 180.8 MiB 0.0 MiB if not ((ii+1) % 100): 66 172.3 MiB 0.0 MiB print(f"Processing [{ii+1}/{len(wsjspeech)}] files.") 67 180.8 MiB 8.4 MiB feat = transform(wsjspeech[ii]) 68 180.8 MiB 0.0 MiB numframes += feat.shape[idx] 69 180.8 MiB 0.0 MiB if (ii+1) > 500: 70 171.0 MiB 0.0 MiB break 71 171.0 MiB 0.0 MiB print(f"""{numframes} frames processed by {transform.name} in {time.time()- start_time:.2f} seconds.""")
For LIBROSA
434479 frames processed by stft_librosa in 36.07 seconds.
Line # Mem usage Increment Line Contents
58 148.6 MiB 148.6 MiB @profile 59 def test_transform(transform): 60 """Test time spent for a transform to process a dataset.""" 61 148.6 MiB 0.0 MiB start_time = time.time() 62 148.6 MiB 0.0 MiB numframes = 0 63 148.6 MiB 0.0 MiB idx = 0 if transform.name.endswith('audlib') else 1 64 166.4 MiB 1.0 MiB for ii, samp in enumerate(wsjspeech): 65 166.4 MiB 0.0 MiB if not ((ii+1) % 100): 66 164.8 MiB 0.0 MiB print(f"Processing [{ii+1}/{len(wsjspeech)}] files.") 67 166.4 MiB 7.0 MiB feat = transform(wsjspeech[ii]) 68 166.4 MiB 0.0 MiB numframes += feat.shape[idx] 69 166.4 MiB 0.0 MiB if (ii+1) > 500: 70 164.1 MiB 0.0 MiB break 71 164.1 MiB 0.0 MiB print(f"""{numframes} frames processed by {transform.name} in {time.time()- start_time:.2f} seconds.""")
Simplicity
Continuous development (for developers)
*Will be removed in the future.
Roadmap
Top-priority stack (before March):
Mid-priority stack (before April):
Other ideas:
Contributing
Current contributors (at least pushed to repo once):
Raymond Xia - yangyanx@andrew.cmu.edu
Mahmoud Alismail - mahmoudi@andrew.cmu.edu
Shangwu Yao - shangwuyao@gmail.com
Joining the developement team, reporting issues, or requesting features are all welcome!