gooofy / zamia-speech

Open tools and data for cloudless automatic speech recognition
GNU Lesser General Public License v3.0
444 stars 84 forks source link

How to create 'data/dst/speech/%s/ai-sentences.txt'? #3

Closed mpuels closed 6 years ago

mpuels commented 6 years ago

Hi Guenter,

I'm trying to get your scripts running and have a question regarding the training of the German language model. I've run

speech$ ./speech_sentences_de.py --train-punkt

which writes speech/data/dst/speech/de/punkt.pickle. And

speech$ ./speech_sentences_de.py

writes speech/data/dst/speech/de/sentences.txt. So far so good. Now I'd like to run speech_build_lm.py, but according to the lines

SOURCES = ['data/dst/speech/%s/sentences.txt',
           'data/dst/speech/%s/ai-sentences.txt']

it also needs the file 'data/dst/speech/%s/ai-sentences.txt'. The command

speech$ grep -rF ai-sentences.txt .

yielded

speech_build_lm.py:           'data/dst/speech/%s/ai-sentences.txt']

So the question is: What does ai-sentences.txt contain and how do I create it? To train the language model ai-sentences.txt is not necessary, because we have sentences.txt. But I'd like to know where ai-sentences.txt comes from :smile:

Thanks for your help in advance!

Cheers, Marc

gooofy commented 6 years ago

Hi Marc,

ai-sentences.txt is the bridge between my zamia-ai and speech projects. The idea is to dump out all sentences accepted by zamia-ai's model to make sure they are covered by the language models for sphinx/kaldi. and yes, this is completely optional, you can have an empty ai-sentences.txt or put any other sentences you want to make sure they're covered into it.

Cheers,

Guenter