Add STT engine based on Kaldi GStreamer server

jasperproject / jasper-client

Client code for Jasper voice computing platform

MIT License

4.53k stars 1.01k forks source link

Add STT engine based on Kaldi GStreamer server #387

Open Kaljurand opened 9 years ago

Kaljurand commented 9 years ago

Add an STT engine based on https://github.com/alumae/kaldi-gstreamer-server which offers an HTTP interface (very similar to GoogleSTT) and a WebSocket interface.

Holzhaus commented 9 years ago

@Kaljurand Can you test the implementation?

Kaljurand commented 9 years ago

Thanks for the quick implementation!

I got it working after changing the headers-line to:

headers = {'Content-Type': 'audio/x-raw-int; rate=%s' % frame_rate}

I didn't test all the error condition handling though.

(Also, had to comment out "import mad" in client/plugin.py because this dependency was not installed as part of the requirements.)

Kaljurand commented 9 years ago

In case you want to test it with an online server then you can use the URL http://bark.phon.ioc.ee/english/speech-api/v1/recognize. It's meant to be a demo (e.g. the recognition models are not very accurate), i.e. it should not be used as the default setting.

Ideally the plugin should be based on the WebSocket interface. This would allow you to stream the audio to the server already while the user is speaking and also start to immediately process the transcription. This would make the whole interaction snappier. I guess some of the other STT plugins would profit from such a streaming mode as well. (But that's a separate issue.)

Holzhaus commented 9 years ago

Ideally the plugin should be based on the WebSocket interface. This would allow you to stream the audio to the server already while the user is speaking and also start to immediately process the transcription. This would make the whole interaction snappier. I guess some of the other STT plugins would profit from such a streaming mode as well. (But that's a separate issue.)

That'd need some major changes to the way recording and transcription is handled right now. It's a good idea, but this needs some thinking.