kmario23 / KenLM-training

Training an n-gram based Language Model using KenLM toolkit for Deep Speech 2
112 stars 21 forks source link

BrokenPipeError: [Errno 32] Broken pipe #1

Open YasineNifa opened 5 years ago

YasineNifa commented 5 years ago

Hello Please I am following this tutorial to create my French Language model : https://github.com/kmario23/KenLM-training But when I type this cmd :
bzcat ./data_final/vocabulary.txt.bz2 | python preprocess.py | /home/innovation/kenlm/bin/lmplz -o 3 > myvocabulary.arpa

I get the following error :

print(' '.join(nltk.word_tokenize(sentence)).lower())
BrokenPipeError: [Errno 32] Broken pipe
Erreur de segmentation (core dumped)
kmario23 commented 5 years ago

Hi @YasineNifa , I haven't encountered such issues with English text. Have you followed the guide exactly? I'd suggest you to pay particular attention to creating a virtual environment. And maybe this discussion on: ioerror-errno-32-broken-pipe-python be helpful?

Please note that the file bible.en.txt.bz2 should be the raw text with single sentence per line. I see that you're using a vocabulary file instead..

YasineNifa commented 5 years ago

Yeah I followed the guide but I did not execute this cmd : bzcat vocabulary.txt.bz2 | python process.py | wc because I did not find the process.py file Yeah the vocabulary file has the same structure as bible file [raw text with single sentence per line]

kmario23 commented 5 years ago

but I did not execute this cmd : bzcat vocabulary.txt.bz2 | python process.py | wc because I did not find the process.py file

Oh sorry. that was a typo. fixed it! Maybe do you have the data publicly available? I can try to replicate the error..

YasineNifa commented 5 years ago

Here is the data I am using : https://voice.mozilla.org/fr/datasets Thx for the time :)

YasineNifa commented 5 years ago

if you want the vocabulary.txt. Here is a link where can you find it https://drive.google.com/open?id=1TJH1O5nQsXXO0tLFPRi2zmWUQK_F4wmc

LiqiangJing commented 3 years ago

Hi, do you fix this question? now I am sturggling with it