taku910 / mecab

Yet another Japanese morphological analyzer
943 stars 218 forks source link

Problems when training. #6

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Hi, first I like to thank taku ku for his awesome mecab.
I'm training MeCab from scratch to make it analyse chinese sentences thanks to 
this website http://www.onaneet.org/blog/archives/4020, but I have some 
troubles while doing it. 

First, I prepared the files 
- dicrc
- char.def
- unk.def
- rewrite.def
- feature.def
as explained on onaneet.
Then I prepared a training corpus for chinese and used mecab-dict-index.
Everithing perfect here.
But, when making mecab-cost-train, if the training corpus has more than around 
700 sentences, the program stops without any error on stderr.

The problem is that 700 sentences for a training is a bit small, isn't it?
And this is an unexpected bug...

I used the Windows version mecab-0.996.exe on a Windows Server 2008 R2 Standard 
for 64x processor.

Original issue reported on code.google.com by lacam...@sinequa.com on 18 Jul 2013 at 7:53

GoogleCodeExporter commented 9 years ago
I have meet the same problem when training Chinese corpus in utf-8 encoding, 
but after changing the encoding to euc-jp it works well.

Original comment by ling0...@gmail.com on 15 Aug 2013 at 12:32