msub2 / sepia-speechrecognition-polyfill

A polyfill for SpeechRecognition built to function with a SEPIA STT server.
MIT License
7 stars 0 forks source link

Implementing grammars #1

Open msub2 opened 1 year ago

msub2 commented 1 year ago

So, during the research I was doing on the existing implementation of SpeechRecognition in other browsers I learned a bit about speech grammars and how they're implemented in the API across browsers. Specifically, that they aren't. You can see there's an open, 5-year-old bug on the Chromium bugtracker about it, with basically no communication from anyone on Google about it. This further adds to the sort of black box aspect of how the API is implemented, as the audio is passed to Google's servers for processing, but it would seem that they simply haven't chosen to support grammars.

But if grammars are basically useless in Chrome, why are there examples on MDN calling SpeechGrammarList.addFromString() with a JSGF string? I wondered this too for a while, but after looking through Mozilla's bugtracker I think I've found my answer, which is that Firefox's initial implementation of the Web Speech API utilized PocketSphinx, which, you guessed it, uses JSGF for its grammars.

Which brings me to my main question, should this polyfill implement grammars? Speech recognition in general is still a fairly new field for me, so I don't really have a horse in this race either way, but there seems to be arguments both for and against it on the Web Speech repo itself, as well as the usage of grammars in general on the Vosk repo and on the PocketSphinx repo. If the polyfill were to support grammars, I think it would make sense to support SRGS too, as it is derived from JSGF.

But what do you think? Hopefully we can have some productive discussion on it and figure out ultimately what would be best, might even be some good data for those still managing the spec.

fquirin commented 1 year ago

It is possible to build language models out of JSGF grammar files. Way back in the days of ILA I used on-the-fly generation of language models from grammar (in a few milliseconds). It would certainly be interesting to explore this topic again for the SEPIA STT server since I was planning to add JSGF support in general to the SEPIA-Home server. So far you can use word boosting (Coqui) and a phrase list (Vosk) to get something similar, but it is not really a grammar.