Open qacollective opened 7 years ago
Hi @ppwwyyxx ,
Sorry to bother you, it seems you're busy lately. I notice that you've said to @blackunicorn47 that a similar feature (#39) he described doesn't yet exist, but do you think my idea for implementing such a feature is feasible?
Have I possibly missed something because I'm certainly not an expert in this field.
Andrew
I don't see why the assumption "probabilities coming from the pre-trained GMMs set will change significantly whenever a speaker change happens" can hold. My guess is that they won't change in a way you can easily detect.
Setting the threshold is another problem. The threshold you're looking for may even be different for each speaker.
Hi Yuxin,
Thanks for those points for caution/consideration. I might start by experimenting with simple diarizing an audio clip before I go completely into the idea I've outlined above.
I'll get back to you with what I find.
Dynamic recognition is certainly possible, the question that remains is the time it takes to do so. Time to develop and time between audio capture and speaker recognition.
In my case The speaker identification I need is limited in number of speakers and among all possible speakers there are those who participate most in the meetings. Of these, we have more than two hundred with audio recordings available. In my case, I need to be dynamic and that the application learns from the errors, which means that I must have a neural network model that includes the person who will monitor the operation of the application in the production environment :-). The most promising models are hybrids and generally include, in addition to the Gaussian Mixture Model, Support Vector Machine, Linear Prediction Cepstral Coefficients, Dynamic Time Warping, Hidden Markov Model etc.
I will keep you updated as the project progresses.
Hi team! Your project is great because it's fast (real-time!) and the GMMs seem quite flexible. For example, from my reading of the source code, it seems possible to run enrolment and prediction on a GMM as audio comes in piece by piece. So I hope those assumptions are correct to start with!
I have an idea that I'd like to get your opinion on whether you think it will be possible and practical.
I want to diarize speakers in real-time and then train GMMs to recognise a voice, and then use that GMM to recognise that voice in future.
My idea is:
Train ~ 30 GMMs (?!?) (15 male, 15 female) on selected different voices - they are marked as 'generic GMMs'.
Run audio though VAD then GMMs purely to detect speaker change in, say, 5 second blocks of audio stream. Train a new GMM on current voice until a speaker change is detected. Add the newly trained GMM to the GMM set (30+1), marked as a 'non-generic GMM'. For this step, you would need to spend time to define things like: a) Minimum probability for same speaker threshold b) Minimum speaking time to add new speaker threshold c) ... probably more
I'm guessing that the probabilities for A - C may also change depending on the number of speakers.
Repeat #2 until end of audio stream.
Dump to disk all GMMs marked as 'non-generic'. Now you have a set of GMMs which have a good chance of recognising the speakers in your audio file and you can:
So of course all of this rests on the assumption that the probabilities coming from the pre-trained GMMs set will change significantly whenever a speaker change happens. But this seems like a reasonable assumption?
I'm struggling to find a flaw in my idea. Welcome anyone to add some thinking before I start to code ... !
Andrew