using automatic transcription

jdittrich commented 5 years ago

Installation and version

Parlatype 1.6 from dl.flathub.org

Your desktop environment

Ubuntu 18.04 with Gnome

Issue

I tried to setup speech recognition. I used several setups, non has shown any effect. Currently I tried with

Language model: 70k 0.2
Dict: cmudict 0.7
Acoustic model: en us 5.2

(but I also tried with others, namely the pruned and ptm files)

What I do: I load an mp3 (or wav) file, set the marker to beginning, set transcription to automatic and play.

Result: It does not put out any text in the textpad, as far as I can see (I can type in there, though)

Note: Possibly my setup is not correct. In this case it could help to suggest combinations of files to use in the settings.

gkarsay commented 5 years ago

Thanks for your report! Actually I wasn't aware, that the English download section doesn't provide a full setup. I would suggest to use the model used by Pocketsphinx, the library that does the speech recognition. It has only one (english) model that I decided not to ship with Parlatype because of its size and I guess not everybody wants an English model, maybe some users don't want this feature at all for any language.

Pocketsphinx is on Github, you can download the whole project as a zip file: https://github.com/cmusphinx/pocketsphinx/archive/master.zip

Extract the directory model/en_us/ and save it somewhere in your home directory. (The flathub version of Parlatype has read only permissions for your home directory only.) Choose this directory on the first page of the assistant. On the second page choose the language model without "phone" in the name. Confirm and this should actually work.

In the CMU Sphinx download section the German models are complete, you can download for example cmusphinx-de-voxforge-5.2.tar.gz and it's fully setup.

I have to admit finding a suitable model isn't always easy and then the results are not always usable. First of all you need a good quality recording. I'm thinking of marking this feature as experimental as it's also below my own expectations. A more serious approach would have to include some adapting/training to improve accuracy but that's out of scope in the moment.

jdittrich commented 5 years ago

I have to admit finding a suitable model isn't always easy and then the results are not always usable.

I really liked that I could try this out, but yes, the results were not great.

I'm thinking of marking this feature as experimental as it's also below my own expectations

Probably makes sense. It is a fun feature to play with but for non-hacker-ish purposes it is probable more of a distraction, currently.

A more serious approach…

I have high hopes for mozilla’s common voice/ deep speech project, but so far there are no easy-to-integrate results.

gkarsay / parlatype