synesthesiam / voice2json

Command-line tools for speech and intent recognition on Linux
MIT License
1.1k stars 63 forks source link

[help] transcribe-stream and transcribe-wav behaving different and giving different output #87

Closed tusharkeshav closed 1 year ago

tusharkeshav commented 1 year ago

Firstly, really thank you for such an amazing tool - voice2json I'm kinda new to voice2json. For me transcribe-stream and transcribe-wav is behaving very different.

On same, wav file they are giving different outputs.

I'm trying to set the brightness. In sentence.ini, i have intent as 'set brightness to {0..100}brightness`

Lets say there is one wav file test.wav When using transcribe-stream, as cat test.wav | voice2json transcribe-wav Output:

{"text": "set brightness to blue", "likelihood": 1, "transcribe_seconds": 3.3602844100005314, "wav_seconds": 7.7821875, "tokens": null}

When using transcribe-wav as cat tusha.wav | voice2json transcribe-stream -a - Output:

{"text": "set brightness to one hundred", "likelihood": 0, "transcribe_seconds": 1.314942486000291, "wav_seconds": 7.2, "tokens": [{"token": "set", "start_time": 0.0, "end_time": 1.74, "likelihood": 1.0}, {"token": "brightness", "start_time": 1.74, "end_time": 2.28, "likelihood": 1.0}, {"token": "to", "start_time": 2.28, "end_time": 2.31, "likelihood": 1.0}, {"token": "one", "start_time": 2.31, "end_time": 2.31, "likelihood": 0.453368}, {"token": "hundred", "start_time": 2.31, "end_time": 7.2, "likelihood": 0.453368}], "timeout": false}

Now, I'm kinda struck on this. As, transcribing-stream is working fine giving correct result but transcribing-wav is giving bit not good results. As far as i know, we need to use transcribing-wav whenever we use it. Please suggest what we can do here. Further, voice2json record-command is not working fine as expected. It's not recording anything. So, i had to use custom one.

arecord -q -r 16000 -c 1 -f S16_LE -t wav > abc.wav sox /tmp/abc.wav -r 16000 -e signed-integer -c 1 -t wav - pad 0 1 > abc_pad.wav voice2json transcribe-wav < abc_pad.wav

I'd really appreciate your help :)

synesthesiam commented 1 year ago

The stream/wav commands operate in different modes with the speech to text system, so they can give different outputs. Streaming is usually less accurate, but faster because it doesn't need to wait until the end to start processing.

It's still early days, but you may be interested in checking out Rhasspy 3: https://github.com/rhasspy/rhasspy3/ This will be the successor to voice2json.

tusharkeshav commented 1 year ago

Thanks Michael for your response! I was successful in integrating code with transcribe-stream. And, it's working perfectly fine.