thunlp / THULAC-Python

An Efficient Lexical Analyzer for Chinese
MIT License
2.02k stars 336 forks source link

cannot cut utf-8 input file, but can cut gbk file #93

Open l1t1 opened 5 years ago

l1t1 commented 5 years ago

D:\Python35-32>python -m thulac inputu.txt output.txt seg_only Model loaded succeed Traceback (most recent call last): File "D:\Python35-32\lib\runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "D:\Python35-32\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "D:\Python35-32\lib\site-packages\thulac__main__.py", line 9, in lac.cut_f(sys.argv[1], sys.argv[2]) File "D:\Python35-32\lib\site-packages\thulac__init__.py", line 189, in cut_f for line in input_f: UnicodeDecodeError: 'gbk' codec can't decode byte 0xb4 in position 42: illegal multibyte sequence

D:\Python35-32>python -m thulac input.txt output.txt seg_only Model loaded succeed successfully cut file input.txt!