mozilla / DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Mozilla Public License 2.0
25.05k stars 3.94k forks source link

Speaker attribution and voice classification #2169

Open div5yesh opened 5 years ago

div5yesh commented 5 years ago

Enhancement: Speaker attribution and voice classification

kdavis-mozilla commented 5 years ago

We can consider this, but generally we're narrowly focused on STT.

div5yesh commented 5 years ago

Transcription that also label the speakers in the audio for each word or dialogue duration can have useful applications and would relate to metadata for STT.

So maybe having a contrib repo, that is an extension to STT which will host features that are related to speech in general. But would result in a powerful DeepSpeech library to cater all kinds of speech related problems.

What do you think?

kdavis-mozilla commented 5 years ago

@div5yesh A contrib repo is a reasonable idea. But the we still have to consider the bandwidth required for review and tests of contrib code and howcontrib is updated across non-backward compatible releases.

What's your take @reuben and @lissyx?

lissyx commented 5 years ago

I have a hard time figuring out exactly how those pieces would have to stick together. IMHO, my experience with such kind of contrib repo is really mixed feelings: often broken, badly maintained. Provides poor user / dev experience, generates frustrations. I understand the need for the feature, but it requires extending the API. How would a contrib repo, in the end, should be integrated to provide that ?

div5yesh commented 5 years ago

@lissyx Your concerns are valid. I think extending the API would be reasonable. Currently, STT has pretty limited use case. Sure, adding more data to the output would definitely allow to address a few more use cases. I believe, having an API to do deep analysis of the speech itself to extract information (not just text) could be more useful.

lissyx commented 5 years ago

I believe, having an API to do deep analysis of the speech itself to extract information (not just text) could be more useful.

We're not saying it's not useful :)

I think extending the API would be reasonable.

Well, we have a Metadata exposed struct that you might be able to extend and experiment with if you are interested.

rhamnett commented 5 years ago

I'd also be interested in this

shashankpr commented 4 years ago

It is an interesting enhancement and there are quite many existing work related to this as well. One of the notable one (which I am using it as a separate fork in my org) is this one: https://github.com/resemble-ai/Resemblyzer

I would be happy to contribute to the feature of speaker identification and classification after I make myself with the DeepSpeech's codebase

lissyx commented 4 years ago

FTR, I've come accross this work https://github.com/ina-foss/inaSpeechSegmenter that might be useful in this context.

diego-fustes commented 4 years ago

I think that Speaker Diarization would be the most useful here as it does not require any training data. The X-Vector model is one of the most powerful implementations, see the Kaldi model

Tortoise17 commented 4 years ago

How to use it if you can give an example. as input wav file?

shravanshetty1 commented 3 years ago

This may or may not help - https://github.com/tyiannak/pyAudioAnalysis