synesthesiam / voice2json

Command-line tools for speech and intent recognition on Linux
MIT License
1.08k stars 63 forks source link

transcribe_wav - json with timings - subtitles #43

Closed marcello-pietrobon closed 3 years ago

marcello-pietrobon commented 3 years ago

Hello, first of all a big thank you for your amazing project, which is of the kind 'you saved my life' :)

I'm trying to adapt your code for transcribe_wav command using the Kaldi acoustic model type in order to extract a subtitle file I tend to believe Kaldi choice is the fastest and the one with the least WER for this job, when not having a GPU card on my PC. Do you agree?.

Judging from the output I get from transcribe_wav it seems to me that the only thing I'd need is just to have a json output with the timing (start, end) of each spoken word of the transcribed audio file. Best would be to be able to have an aligner and in fact I tried to adapt the code from the gentle project https://github.com/lowerquality/gentle/blob/master/align.py

Maybe this feature or something close is already available by changing some options in the kaldi_cmd used in _transcribe_wav_nnet3(), I just don't know. I've tried to reuse part of the code of gentle but I see that I would probably need to adapt some C++ code (like the gentle\ext\m3.cc application code that gentle uses for this job) but of course I don'r want to try and go that far before asking.

Any suggestion on what should I do, or work around?

Thank you, Marcello

synesthesiam commented 3 years ago

Hi @marcello-pietrobon, thank you for the kind words. I'm glad that voice2json has been able to help you. :)

I would agree that Kaldi is the best choice for now. I'm in the process of upgrading my DeepSpeech code to 0.9.3, so that may make the choice more complicated in the future (in a good way).

Have you tried the transcribe-stream command? If you run it like this:

$ voice2json transcribe-stream --event-sink /dev/stdout

You'll see the timing messages that tell you when the voice command has started and stopped (in seconds since the audio began). That plus the following transcript should hopefully be what you're looking for.

Don't forget too that transcribe-stream takes raw audio instead of WAV, so you'll need to do something like:

$ sox input.wav -r 16000 -b 16 -c 1 -e signed-integer -t raw - | voice2json transcribe-stream --audio-source - ...