gongliym / data2text-transformer

Enhanced Transformer Model for Data-to-Text Generation
29 stars 11 forks source link

UnicodeDecodeError #2

Open KonstantinRothe opened 4 years ago

KonstantinRothe commented 4 years ago
Traceback (most recent call last):
  File "model/preprocess_summary_data.py", line 53, in <module>
    args.summary+".pth", max_len=args.summary_max_length)
  File "C:\Users\user\Python\Data2Text\model\src\data\dictionary.py", line 334, in index_summary
    for i, (summary_line, label_line) in enumerate(zip(summary_inf, summary_label_inf)):
  File "C:\Users\user\miniconda3\envs\pytorchENV\lib\codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 7187: invalid start byte

this is the error message I get when trying to use preprocess_summary_data.py on the rotowire dataset. Everything else until this point worked fine.

KonstantinRothe commented 3 years ago

Found the problem: there are a few non-Unicode character in dataset. After I removed them from the baseline rotowire set everything worked nicely