raymondxyy / pyaudlib

A speech signal processing library in Python with emphasis on deep learning.
MIT License
31 stars 6 forks source link

Demo structure #10

Open mahmoudalismail opened 5 years ago

mahmoudalismail commented 5 years ago

Audlib Demo

Demo is on January 22 January 29, 2019 in our ROBUST-MLSP group meeting.

Outlines

  1. Motivation/Why?
  2. Functionalities
  3. Contributors
  4. Difference between this package and existing packages
    1. Pythonic library
    2. Lazy evaluation
  5. Demo
    1. Easy to use interface
      1. Feature extraction
      2. Data preprocessing (Add multi-threading to audiopipe.py)
        1. HPC
      3. Pytorch compatible dataset
    2. Performance compared to librosa
      1. Optimization (mfcc computation,...etc.)
  6. Roadmap
  7. Contributing

Motivation/What is pyaudlib?

Pyaudlib is a speech processing library in Python with emphasis on deep learning.

Popular speech/audio processing libraries have no deep learning support:

Generic deep learning libraries have good image processing support, but not for audio:

pyaudlib (name subject to change) provides a collection of utilities for developing speech-related applications using both signal processing and deep learning.

Functionalities

pyaudlib offers the following high-level features:

*Under development.

Difference between pyaudlib and existing libraries

  1. Correctness

    • Unit testing is done on all signal processing functions

    • User inputs are checked for correctness

      
      >>> wind = hamming(512, hop=.75, synth=True)
      AssertionError: [wsize:512, hop:0.75] violates COLA in time.

    wind = hamming(512, hop=.5, synth=True) # ok!

    • No unexpected output
      
      >>> # Using audlib
      >>> sig, sr = audioread('samples/welcome16k.wav')
      >>> sigspec = stft(sig, sr, wind, .5, 512, synth=True)
      >>> sigsynth = istft(sigspec, sr, wind, .5, 512)
      >>> np.allclose(sig, sigsynth[:len(sig)])
      True

    Using librosa (you might not expect this)*

    nfft = 512 sigpad = fix_length(sig, len(sig)+nfft//2) D = stft(sigpad, n_fft=nfft) sigsynth = istft(D, length=len(sig)) np.allclose(sig, sigsynth) False np.sum(np.abs(sig-sigsynth)) 0.00012380785899053157

    
    *This is the [official example](https://librosa.github.io/librosa/generated/librosa.core.istft.html) given by librosa.
  2. Efficiency

    • All functionalities are profiled in terms of time and space complexity.

    • Frequently used utilities are already up to speed with popular libraries

      >>> %timeit stft_audlib(sig, sr, hamming(int(window_length*sr), hop=hopfrac), hopfrac, nfft)
      628 µs ± 2.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
      >>> %timeit stft_librosa(sig, n_fft=nfft, hop_length=int(window_length*sr*hopfrac), win_length=int(window_length*sr))
      757 µs ± 2.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
      >>> %timeit melspec_audlib(sig, sr, wind, hopfrac, nfft, MelFreq(sr, nfft, nmels))
      >>> 1.07 ms ± 4.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
      >>> %timeit melspec_librosa(S=np.abs(stft_librosa(sig, n_fft=nfft, hop_length=int(window_length*sr*hopfrac), win_length=int(wi
      ...: ndow_length*sr)))**2, n_mels=nmels)
      >>> 1.52 ms ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    • Memory footprint still has room to improve

      • Memory usage affects speed for processing of large amount of data.
        
        ## For AUDLIB ##
        434476 frames processed by stft_audlib in 36.76 seconds.

    Line # Mem usage Increment Line Contents

    58 145.5 MiB 145.5 MiB @profile 59 def test_transform(transform): 60 """Test time spent for a transform to process a dataset.""" 61 145.5 MiB 0.0 MiB start_time = time.time() 62 145.5 MiB 0.0 MiB numframes = 0 63 145.5 MiB 0.0 MiB idx = 0 if transform.name.endswith('audlib') else 1 64 180.8 MiB 1.5 MiB for ii, samp in enumerate(wsjspeech): 65 180.8 MiB 0.0 MiB if not ((ii+1) % 100): 66 172.3 MiB 0.0 MiB print(f"Processing [{ii+1}/{len(wsjspeech)}] files.") 67 180.8 MiB 8.4 MiB feat = transform(wsjspeech[ii]) 68 180.8 MiB 0.0 MiB numframes += feat.shape[idx] 69 180.8 MiB 0.0 MiB if (ii+1) > 500: 70 171.0 MiB 0.0 MiB break 71 171.0 MiB 0.0 MiB print(f"""{numframes} frames processed by {transform.name} in {time.time()- start_time:.2f} seconds.""")

    For LIBROSA

    434479 frames processed by stft_librosa in 36.07 seconds.

    Line # Mem usage Increment Line Contents

    58 148.6 MiB 148.6 MiB @profile 59 def test_transform(transform): 60 """Test time spent for a transform to process a dataset.""" 61 148.6 MiB 0.0 MiB start_time = time.time() 62 148.6 MiB 0.0 MiB numframes = 0 63 148.6 MiB 0.0 MiB idx = 0 if transform.name.endswith('audlib') else 1 64 166.4 MiB 1.0 MiB for ii, samp in enumerate(wsjspeech): 65 166.4 MiB 0.0 MiB if not ((ii+1) % 100): 66 164.8 MiB 0.0 MiB print(f"Processing [{ii+1}/{len(wsjspeech)}] files.") 67 166.4 MiB 7.0 MiB feat = transform(wsjspeech[ii]) 68 166.4 MiB 0.0 MiB numframes += feat.shape[idx] 69 166.4 MiB 0.0 MiB if (ii+1) > 500: 70 164.1 MiB 0.0 MiB break 71 164.1 MiB 0.0 MiB print(f"""{numframes} frames processed by {transform.name} in {time.time()- start_time:.2f} seconds.""")

    
    - A note on programming pattern in our group
    
    We have seen three patterns for pre-processing audio data before feeding them into a NN:
    1. Extract all features and save to a file, then load them all at once when needed.
        - Maximum disk space
        - Unacceptable usage of memory
        - Extremely slow runtime
    2. Extract and save each feature to a separate file, then load them when needed.
        - Maximum disk space
        - Minimal memory footprint, given that features are loaded on-demand
        - Fastest runtime 
    3. Extract features on-the-fly.
        - No disk space
        - Very small memory footprint
        - Moderate runtime
  3. Simplicity

    • Syntax is simple but does not over-simplify
    • Dataset creation complies to torchvision style
    • When in doubts, there are example IPython notebooks to reference
  4. Continuous development (for developers)

    • Codebase is written and documented according to industry standard (PEP 8, NumPy docstring guide)
    • Continuous integration (MAHMOUD: Add something here)
    • No high-level dependencies. Credible low-level dependencies are included when absolutely required:
      • PyTorch for DNN implementations and GPU calculation
      • NumPy for multi-dimensional array computation
      • Click for command-line interface
      • SoundFile for audio I/O
      • resampy for resampling*
      • SciPy for filtering*
      • Matplotlib for plotting

    *Will be removed in the future.

Roadmap

Top-priority stack (before March):

Mid-priority stack (before April):

Other ideas:

Contributing

Current contributors (at least pushed to repo once):

Raymond Xia - yangyanx@andrew.cmu.edu

Mahmoud Alismail - mahmoudi@andrew.cmu.edu

Shangwu Yao - shangwuyao@gmail.com

Joining the developement team, reporting issues, or requesting features are all welcome!