Open MatthewWaller opened 4 years ago
So the question was to find a way to deal with the mfcc processing which the source pcm/wav stored in memory.
There's a lot of open framework written in C++, such as essentia, sox, and I think it's not hard to write by yourself or search on GitHub since mfcc was commonly used.
Then you should use AVFoundation framework or OpenAL in iOS to handle the realtime audio queue and get the data in another thread to do diarization.
I don't think it's a cumbersome work but need time to debug.
Ah! So you're saying that the inputs are basically mfccs? I actually wrote an mfcc calculator in Swift in the style of python_speech_features. So should I be able to take, say, a 10 ms buffer of audio (or more), and use that as input to the model and get a prediction? I'm thinking of getting predictions while the audio is streaming (asynchronously of course, but still).
Yes, you can read https://github.com/taylorlu/Speaker-Diarization/blob/403783173239a33bfd4de0774921ba9479413641/speakerDiarization.py#L105 the code used librosa to extract audio feature, and can accurately calculate the time point by the sliding window.
Interesting. So looking at the speakerDiarization.py
code more closely, would I need to import two models into the iOS or Android app?
Looks like I would load one model here:
network_eval = spkModel.vggvox_resnet2d_icassp(input_dim=params['dim'],
num_class=params['n_classes'],
mode='eval', args=args)
network_eval.load_weights(args.resume, by_name=True)
So that I analyze the features with it and send its output to this model here:
uisrnnModel = uisrnn.UISRNN(model_args)
uisrnnModel.load(SAVED_MODEL_NAME)
Am I reading that right?
Hi @MatthewWaller Is there any update on this? Are you running this on mobile devices?
I am also working on offline speaker diarization on a mobile device (Android).
My questions are:
Any help appreciated. Thanks
Hello!
Thank you @taylorlu for all your work here, first off.
I'm working to get a handle on speaker diarization, and wanted to know if you had an idea of what might be involved in getting this system to work on mobile.
Assuming I could get the actual Pytorch model successfully loaded on a mobile device, either by using Pytorch's SDK for that directly or converting to CoreML on iOS or some such, what kind of audio preprocessing is needed when feeding each buffer of audio to the model?
If the question is too broad or far reaching, please let me know of any resources I might look at to gain some perspective, or any good spots in the code to examine what the preprocessing looks like for inference.
Thanks!