utterworks / fast-bert

Super easy library for BERT based NLP models
Apache License 2.0
1.85k stars 342 forks source link

BertLMDataBunch.from_raw_corpus: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 49: invalid continuation byte #238

Closed NawelAr closed 4 years ago

NawelAr commented 4 years ago

Hello, I get UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 49: invalid continuation byte create a LMDataBunch when creating a LMDataBunch from raw corpus. Does someone know how to fix this ? Thanks,


UnicodeDecodeError Traceback (most recent call last)

in 7 multi_gpu=False, 8 model_type='camembert-base', ----> 9 logger=logger) ~\anaconda3\lib\site-packages\fast_bert\data_lm.py in from_raw_corpus(data_dir, text_list, tokenizer, batch_size_per_gpu, max_seq_length, multi_gpu, test_size, model_type, logger, clear_cache, no_cache) 198 logger=logger, 199 clear_cache=clear_cache, --> 200 no_cache=no_cache, 201 ) 202 ~\anaconda3\lib\site-packages\fast_bert\data_lm.py in __init__(self, data_dir, tokenizer, train_file, val_file, batch_size_per_gpu, max_seq_length, multi_gpu, model_type, logger, clear_cache, no_cache) 270 cached_features_file, 271 self.logger, --> 272 block_size=self.tokenizer.max_len_single_sentence, 273 ) 274 ~\anaconda3\lib\site-packages\fast_bert\data_lm.py in __init__(self, tokenizer, file_path, cache_path, logger, block_size) 131 self.examples = [] 132 with open(file_path, encoding="utf-8") as f: --> 133 text = f.read() 134 135 tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text)) ~\anaconda3\lib\codecs.py in decode(self, input, final) 320 # decode input (taking the buffer into account) 321 data = self.buffer + input --> 322 (result, consumed) = self._buffer_decode(data, self.errors, final) 323 # keep undecoded input until the next call 324 self.buffer = data[consumed:] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 49: invalid continuation byte
NawelAr commented 4 years ago

I just had to encode my data files using UTF-8