srvk / eesen

The official repository of the Eesen project
http://arxiv.org/abs/1507.08240
Apache License 2.0
822 stars 342 forks source link

separate the model loading and decoder #117

Closed madhavsund closed 7 years ago

madhavsund commented 7 years ago

I would like to develop an web interface for live testing the acoustic model ( record and send the wave file to server). For this I need to load a model one time and decode as and when request comes. My problem is whether it is possible to separate the model loading and decoder in EESEN.

riebling commented 7 years ago

I'm not sure if I understand what is meant by loading the model. Decoding uses a model file on disk, the path to the model is a parameter to decoding scripts. The model path is specified as a Makefile variable, located in the file Makefile.options - the variable is MODEL_DIR. So if you initiate decoding by running the Makefile in a way similar to the control script speech2text.sh does, you could change MODEL_DIR each time to try different models.

madhavsund commented 7 years ago

aim is to increase the recognition speed here Model is fixed, but the input keeps on changing. so, instead of loading model each time, is there any possibility to load model once and decode as input arrives

riebling commented 7 years ago

Ok that makes more sense. It almost sounds like a description of what you would do with a real-time continuous transcriber: initialize the system with everything it needs (models), start it running, then feed incremental chunks of audio as input, and have output for each chunk. Probably the chunk size would be at the level of utterance / sentence. We plan to make a version of Eesen (transcriber) that can do continuous, real-time decoding, but it is not yet available.

As far as speed, using 30ms frame size, and a smaller decoding graph (built from a smaller language model) the speed of transcription is better-than-real-time, so for example a minute of audio would take less than a minute to transcribe. I forget the exact timing, but I think it takes about 1/3 of the duration of the audio to transcribe. - because I was working on a system that combines 3 different transcription alternatives, and transcribe time was still comparable to the duration of the audio input.

Stay tuned, and we will keep these comments in mind when we release a real-time version of Eesen transcriber.

madhavsund commented 7 years ago

Ok