ubclaunchpad / minutes

:telescope: Speaker diarization via transfer learning
https://medium.com/ubc-launch-pad-software-engineering-blog/speaker-diarisation-using-transfer-learning-47ca1a1226f4
27 stars 5 forks source link

Chad/#103 transition minutes from api to library #107

Closed chadlagore closed 6 years ago

chadlagore commented 6 years ago

Handles #103 #108 #111

:construction_worker: Changes

Typical Usage

from minutes import Speaker, Minutes

minutes = Minutes(ms_per_observation=500, model='cnn')

# Create some speakers with some audio.
speaker1 = Speaker('speaker1')
speaker1.add_audio('path/to/audio1.wav')

speaker2 = Speaker('speaker2')
speaker2.add_audio('path/to/audio2.wav')

# Add speakers to the model.
minutes.add_speakers([speaker1, speaker2])

# Fit the model.
minutes.fit()  # Currently breaks (have to refit base model).
result = minutes.predict()

Rebuilding the Base Model With New Speakers

(Bring Your Own GPU)

from minutes import Speaker, Minutes
from minutes.base import BaseModel

model = BaseModel('my_cnn_base', ms_per_observation=500)

# Create some speakers with some (large) audio.
speaker1 = Speaker('speaker1')
speaker1.add_audio('path/to/audio1.wav')

speaker2 = Speaker('speaker2')
speaker2.add_audio('path/to/audio2.wav')

# Add speakers to the model.
minutes.add_speakers([speaker1, speaker2])

# Fit the model.
model.fit()  # Prints validation results....
model.save()

# Use the new base model.
minutes = Minutes(model='my_cnn_base')
# ... add speakers, predict etc.

:flashlight: Testing Instructions

For now,

$ py.test -vvv --cov=minutes test

Lets keep the coverage up!

Repo Layout

.
├── README.md
├── bld.bat
├── build.sh
├── environment.yml
├── meta.yaml
├── minutes
│   ├── __init__.py
│   ├── audio.py
│   ├── base.py
│   ├── conversation.py
│   ├── minutes.py
│   ├── models
│   │   ├── __init__.py
│   │   └── cnn.h5
│   └── speaker.py
├── setup.py
└── test
    ├── __init__.py
    ├── config.py
    ├── fixtures
    │   ├── sample1.wav
    │   └── sample2.wav
    ├── test_audio.py
    ├── test_base.py
    ├── test_minutes.py
    └── test_speaker.py

Base Model(s)

base_model = BaseModel('taco', ms_per_observation=3000)
speaker1 = Speaker('4640')
speaker2 = Speaker('8098')
speaker3 = Speaker('441')

speaker1.add_audio('test/fixtures/4640')
speaker2.add_audio('test/fixtures/8098')
speaker3.add_audio('test/fixtures/441')

base_model.add_speaker(speaker1)
base_model.add_speaker(speaker2)
base_model.add_speaker(speaker3)

base_model.fit(verbose=2)
...
Epoch 45/50
 - 3s - loss: 0.6708 - acc: 0.8345 - val_loss: 0.6273 - val_acc: 0.9076
Epoch 46/50
 - 3s - loss: 0.6404 - acc: 0.8414 - val_loss: 0.6115 - val_acc: 0.9076
Epoch 47/50
 - 3s - loss: 0.6175 - acc: 0.8454 - val_loss: 0.5964 - val_acc: 0.9076
Epoch 48/50
 - 3s - loss: 0.5998 - acc: 0.8553 - val_loss: 0.5815 - val_acc: 0.9076
Epoch 49/50
 - 3s - loss: 0.6101 - acc: 0.8583 - val_loss: 0.5678 - val_acc: 0.9056
Epoch 50/50
 - 3s - loss: 0.5849 - acc: 0.8632 - val_loss: 0.5533 - val_acc: 0.9137
base_model.model.save('minutes/models/cnn.h5')
base_model.model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv1d_3 (Conv1D)            (None, 49, 32)            219168    
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 24, 32)            0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 24, 32)            0         
_________________________________________________________________
flatten_3 (Flatten)          (None, 768)               0         
_________________________________________________________________
dense_21 (Dense)             (None, 128)               98432     
_________________________________________________________________
dense_22 (Dense)             (None, 3)                 387       
=================================================================
Total params: 317,987
Trainable params: 317,987
Non-trainable params: 0
_________________________________________________________________

Flac-Wav Conversion

Rather crudely, I did something like:

import glob
from subprocess import call

flac_files = glob.glob('test/fixtures/5561' + '/**/*.flac', recursive=True)

for file in flac_files:
    call(["ffmpeg", "-i", file, file.strip('.flac') + '.wav'])

This is obviously not an option for the library. We should find a way to read in .flac files properly (#110).

Other Notes

Epoch 12/15
 - 1s - loss: 0.7596 - acc: 0.8963 - val_loss: 0.7163 - val_acc: 0.9102
Epoch 13/15
 - 1s - loss: 0.7092 - acc: 0.8872 - val_loss: 0.6522 - val_acc: 0.9224
Epoch 14/15
 - 1s - loss: 0.6473 - acc: 0.8943 - val_loss: 0.5936 - val_acc: 0.9163
Epoch 15/15
 - 2s - loss: 0.5940 - acc: 0.9023 - val_loss: 0.5443 - val_acc: 0.9286
iKevinY commented 6 years ago

What code is generating the two spectrograms that are in your PR description? I'm curious why so much of the second one is purple (what information is it actually encoding, compared to the first?) 😮

chadlagore commented 6 years ago

https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.spectrogram.html mode="phase" produces the first spectrogram. I agree that the fact that the purple spectrogram learns is rather surprising. #114 will allow users to configure this a bit more.