syl22-00 / pocketsphinx.js

Speech recognition in JavaScript and WebAssembly
1.49k stars 261 forks source link

Size of dictionary file #108

Closed 6gsaifulislam closed 7 years ago

6gsaifulislam commented 7 years ago

Just a general question - should the list of words in the dictionary file be the complete list of words used in a language (i.e. the dict file supplied by cmu) or should it just be limited to words which would be recognised by the pocketshinx process?

For example if used in a game only a few words will be required - (up, down, left, right, go, stop) - but if the dictionary is limited to those words would it create a lot of false positive results?

syl22-00 commented 7 years ago

The words that might be recognized depend on the language model (grammar or statistical language model), not the pronunciation dictionary. But all words in the language model must be in the dictionary. You can add words to the dictionary at init time through a dictionary file or at runtime via the addWords function.

6gsaifulislam commented 7 years ago

@syl22-00

Thanks Sylvain;

But all words in the language model must be in the dictionary

So if I understand you correctly, if I am using the cmusphinx-fr-ptm-5.2 data - then I should use the fr.dict that came in that download as it is rather than reduce the size of it by removing words not needed.

syl22-00 commented 7 years ago

Yes, if you use a statistical language model, you should use the associated dictionary. Those might be large, I am not sure how well a browser handles it. If it is too large, I'd try to compile to webassembly.

6gsaifulislam commented 7 years ago

@syl22-00 Many thanks I will use the downloaded dictionary - it is only 3 mbytes so not too large. It might explain why when using a much reduced dictionary file I was getting a lot of false positives.