lvapeab / nmt-keras

Neural Machine Translation with Keras
http://nmt-keras.readthedocs.io
MIT License
533 stars 130 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd6 in position 0: #103

Closed RookieCXL closed 5 years ago

RookieCXL commented 5 years ago

I used another dataset for training,for Chinese to English. After model trained,when I run sample_ensemble.py to translate a chinese text, something wrong happen.

2019-05-17 09:35:00.359418: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1409 MB memory) -> physical GPU (device: 0, name: GeForce 940MX, pci bus id: 0000:02:00.0, compute capability: 5.0) [17/05/2019 09:35:00] <<< Loading optimized model... >>> [17/05/2019 09:35:03] <<< Optimized model loaded. >>> [17/05/2019 09:35:03] <<< Model loaded in 9.3835 seconds. >>> [17/05/2019 09:35:03] <<< Loading Dataset instance from datasets\Dataset_ZhEnTrans_zhen.pkl ... >>> [17/05/2019 09:35:03] <<< Dataset instance loaded >>> [17/05/2019 09:35:03] Removed "val" set output with id "target_text. Traceback (most recent call last): File "sample_ensemble.py", line 62, in sample_ensemble(args, params) File "C:\Users\think\Desktop\nmt-keras-master\nmt_keras\apply_model.py", line 41, in sample_ensemble dataset = update_dataset_from_file(dataset, args.text, params, splits=args.splits, remove_outputs=True) File "C:\Users\think\Desktop\nmt-keras-master\data_engine\prepare_data.py", line 79, in update_dataset_from_file overwrite_split=True) File "c:\users\think\src\keras-wrapper\keras_wrapper\dataset.py", line 1042, in setInput bpe_codes=bpe_codes, separator=separator, use_unk_class=use_unk_class) File "c:\users\think\src\keras-wrapper\keraswrapper\dataset.py", line 1693, in preprocessTextFeatures for line in list: File "C:\Users\think\AppData\Local\Programs\Python\Python36\lib\codecs.py", line 711, in next return next(self.reader) File "C:\Users\think\AppData\Local\Programs\Python\Python36\lib\codecs.py", line 642, in next line = self.readline() File "C:\Users\think\AppData\Local\Programs\Python\Python36\lib\codecs.py", line 555, in readline data = self.read(readsize, firstline=True) File "C:\Users\think\AppData\Local\Programs\Python\Python36\lib\codecs.py", line 501, in read newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd6 in position 0: invalid continuation byte

lvapeab commented 5 years ago

Hi @RookieCXL ,

it seems that the new file you're trying to translate is on a codification different than utf-8. I suggest you to convert your file to utf-8 for avoiding these encoding issues.

Feel free to reopen this issue if after converting your file the error persists.