Using speaker diarization on mobile devices

taylorlu / Speaker-Diarization

speaker diarization by uis-rnn and speaker embedding by vgg-speaker-recognition

Apache License 2.0

469 stars 121 forks source link

Using speaker diarization on mobile devices #24

Open MatthewWaller opened 4 years ago

MatthewWaller commented 4 years ago

Hello!

Thank you @taylorlu for all your work here, first off.

I'm working to get a handle on speaker diarization, and wanted to know if you had an idea of what might be involved in getting this system to work on mobile.

Assuming I could get the actual Pytorch model successfully loaded on a mobile device, either by using Pytorch's SDK for that directly or converting to CoreML on iOS or some such, what kind of audio preprocessing is needed when feeding each buffer of audio to the model?

If the question is too broad or far reaching, please let me know of any resources I might look at to gain some perspective, or any good spots in the code to examine what the preprocessing looks like for inference.

Thanks!

taylorlu commented 4 years ago

So the question was to find a way to deal with the mfcc processing which the source pcm/wav stored in memory.

There's a lot of open framework written in C++, such as essentia, sox, and I think it's not hard to write by yourself or search on GitHub since mfcc was commonly used.

Then you should use AVFoundation framework or OpenAL in iOS to handle the realtime audio queue and get the data in another thread to do diarization.

I don't think it's a cumbersome work but need time to debug.

MatthewWaller commented 4 years ago

Ah! So you're saying that the inputs are basically mfccs? I actually wrote an mfcc calculator in Swift in the style of python_speech_features. So should I be able to take, say, a 10 ms buffer of audio (or more), and use that as input to the model and get a prediction? I'm thinking of getting predictions while the audio is streaming (asynchronously of course, but still).

taylorlu commented 4 years ago

Yes, you can read https://github.com/taylorlu/Speaker-Diarization/blob/403783173239a33bfd4de0774921ba9479413641/speakerDiarization.py#L105 the code used librosa to extract audio feature, and can accurately calculate the time point by the sliding window.

MatthewWaller commented 4 years ago

Interesting. So looking at the speakerDiarization.py code more closely, would I need to import two models into the iOS or Android app?

Looks like I would load one model here:

network_eval = spkModel.vggvox_resnet2d_icassp(input_dim=params['dim'],
                                                num_class=params['n_classes'],
                                                mode='eval', args=args)
network_eval.load_weights(args.resume, by_name=True)

So that I analyze the features with it and send its output to this model here:

uisrnnModel = uisrnn.UISRNN(model_args)
uisrnnModel.load(SAVED_MODEL_NAME)

Am I reading that right?

alam-botify commented 4 years ago

Hi @MatthewWaller Is there any update on this? Are you running this on mobile devices?

I am also working on offline speaker diarization on a mobile device (Android).

My questions are:

How to convert saved_model.uisrnn_benchmark model to some other format like TFLITE or pytorch (pt) compatible with android?
If you run it on mobile devices then how fast it runs (speed and accuracy)?
@taylorlu Do you have any idea how it works on mobile devices and Is it feasible to do it offline?

Any help appreciated. Thanks