synesthesiam / voice2json

Command-line tools for speech and intent recognition on Linux
MIT License
1.09k stars 63 forks source link

audio-source - for transcribe-stream ? #12

Closed farfade closed 4 years ago

farfade commented 4 years ago

Hello @synesthesiam, and thanks for your amazing work !

I am trying to stream from MQTT to transcribe-stream, but I can't.

When I try to transcribe-stream from stdin :

sox -t wav /tmp/test.wav -t wav - | /usr/bin/voice2json --debug transcribe-stream --audio-source -

I get that :

AttributeError: 'NoneType' object has no attribute 'stdout'

but I don't understand when I spoke about stdout ?

Regards,

Romain

johanneskropf commented 4 years ago

Transcribe-stream would expect something like this: rec -L -e signed-integer -c 1 -r 16000 -b 16 -t raw - | and than your voice2json command. This is using sox record, you could also use arecord as follows: arecord -q -r 16000 -c 1 -f S16_LE -t raw Which is also what voice2json uses to record when you call transcribe-stream, record-command or wait-wake without the audio-source argument. Those commands all three expect a stream of raw audio chunks to be streamed to stdin when using audio-source - which is what both commands above do. It is also important that they record in the right format and encoding and make sure you are using the 2.0.0 pre release as transcribe-stream was not added before. For transcribing a file send over mqtt why not use transcribe-wav instead. You can keep it permanently running if used with the —stdin-files argument and just pass the path to the file you want to be transcribes to the running process on stdin when it arrives over mqtt. This greatly speeds up transcription especially with docker as when running all the libraries and resources dont need to be loaded everytime. Hope this gives you some ideas.

farfade commented 4 years ago

Thank you @johanneskropf for explanations and ideas :)

I'm definitely trying, as you understood, to speed up the whole thing (so writing a full wav file to the disk and reading it after is not an option), keeping it simple.

As mosquitto_sub can listen continuously to a topic where a publisher streams raw wav after having detected an hotword, piping it directly to voice2json would be the best thing to do to my mind.

My target is a MQTT, network-enabled, service running something like :

mosquitto_sub (raw wav) | voice2json transcribe-stream | voice2json recognize-intent | mosquitto_pub (intent)

where mosquitto_sub would listen indefinitely to what is posted to hermes/audioServer//audioFrame by snips-satellite (for the moment, and after that by something opensourced that will replace it)

So I've got :

johanneskropf commented 4 years ago

Yes it expects raw audio only. It actually uses webrtc vad internally i think to determine when a command was finished speaking. It does it pretty much the same way that record-command does. It records until it thinks the user stopped speaking or a timeout that is defined in your profile.yml is reached. You will probably have to introduce some kind of intermediary to your workflow. You could use nodered. There you could receive the data from the snips satellite, remove the timestamp and than remove the wav header and send the raw audio to transcribe stream. But as the snips satellite does all its own vad i dont see this as very efficient. I would still recommend what i said above for using transcribe-wav. You can keep it running with the stdin-file argument and just pass the reference to the file that arrived from snips. You can use nodered to compile a single buffer from the stream and write it to a tmp file really easily. And as i said keeping transcribe-wav running really negates alot of the speed problems. I see faster transcription using kaldi this way than i had with snips.

farfade commented 4 years ago

You're right, with your detailed explanations I understand now why I definitely have to use transcribe-wav.

Thank you very much ! I now just have to find spare time to do it :)