gitabtion / SoftMaskedBert-PyTorch

🙈 An unofficial implementation of SoftMaskedBert based on huggingface/transformers.
MIT License
94 stars 17 forks source link

'utf-8' codec can't decode byte 0x80 in position 4867: invalid start byte #12

Closed nocoolsandwich closed 3 years ago

nocoolsandwich commented 3 years ago

哥,处理数据出现这个问题,linux下跑也是这样

      (tf2) [root@nlp SoftMaskedBert-PyTorch-main]# python main.py --mode preproc  
      preprocessing...
      Traceback (most recent call last):
        File "main.py", line 99, in <module>    
          main()  
        File "main.py", line 63, in main  
          preproc()  
        File "/root/sammy/ForceWord/SoftMaskedBert-PyTorch-main/src/data_processor.py", line 187, in preproc  
          for item in read_data(get_abs_path('data')):  
        File "/root/sammy/ForceWord/SoftMaskedBert-PyTorch-main/src/data_processor.py", line 117, in read_data  
          for line in f:  
        File "/root/anaconda3/envs/tf2/lib/python3.7/codecs.py", line 322, in decode  
          (result, consumed) = self._buffer_decode(data, self.errors, final)  
      UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 4867: invalid start byte  
nocoolsandwich commented 3 years ago

open加这个 'r', errors='ignore',其他的加try

so-coolboy commented 3 years ago

编码错误可以用errors='ignore'解决,之后报KeyError错误,是因为数据中有的id并不对应,可以去对应文件中查找到不对应的数据,修改一下id就可以了。