Open kujaomega opened 4 years ago
When I'm trying to execute the BertDataBunch class, with the following parameters:
BertDataBunch('./', './', tokenizer='bert-base-uncased', train_file='train.csv', val_file='val.csv', label_file='labels.csv', text_col='text', label_col=[ "POSITIVE", "NEGATIVE" ], batch_size_per_gpu=16, max_seq_length=512, multi_gpu=False, multi_label=True, model_type='bert')
I'm getting the following errror:
TypeError Traceback (most recent call last) <ipython-input-4-8957d7fa0907> in <module> 43 max_seq_length=512, 44 multi_gpu=False, ---> 45 multi_label=True, model_type='bert') /usr/local/lib/python3.6/dist-packages/fast_bert/data_cls.py in __init__(self, data_dir, label_dir, tokenizer, train_file, val_file, test_data, label_file, text_col, label_col, batch_size_per_gpu, max_seq_length, multi_gpu, multi_label, backend, model_type, logger, clear_cache, no_cache) 453 454 train_dataset = self.get_dataset_from_examples( --> 455 train_examples, "train", no_cache=self.no_cache 456 ) 457 /usr/local/lib/python3.6/dist-packages/fast_bert/data_cls.py in get_dataset_from_examples(self, examples, set_type, is_test, no_cache) 580 pad_on_left=bool(self.model_type in ["xlnet"]), 581 pad_token_segment_id=4 if self.model_type in ["xlnet"] else 0, --> 582 logger=self.logger, 583 ) 584 /usr/local/lib/python3.6/dist-packages/fast_bert/data_cls.py in convert_examples_to_features(examples, label_list, max_seq_length, tokenizer, output_mode, cls_token_at_end, pad_on_left, cls_token, sep_token, pad_token, sequence_a_segment_id, sequence_b_segment_id, cls_token_segment_id, pad_token_segment_id, mask_padding_with_zero, logger) 133 logger.info("Writing example %d of %d" % (ex_index, len(examples))) 134 --> 135 tokens_a = tokenizer.tokenize(example.text_a) 136 137 tokens_b = None /usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in tokenize(self, text, **kwargs) 674 675 if self.init_kwargs.get("do_lower_case", False): --> 676 text = lowercase_text(text) 677 678 def split_on_token(tok, text): /usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in lowercase_text(t) 671 escaped_special_toks = [re.escape(s_tok) for s_tok in all_special_tokens] 672 pattern = r"(" + r"|".join(escaped_special_toks) + r")|" + r"(.+?)" --> 673 return re.sub(pattern, lambda m: m.groups()[0] or m.groups()[1].lower(), t) 674 675 if self.init_kwargs.get("do_lower_case", False): /usr/lib/python3.6/re.py in sub(pattern, repl, string, count, flags) 189 a callable, it's passed the match object and must return 190 a replacement string to be used.""" --> 191 return _compile(pattern, flags).sub(repl, string, count) 192 193 def subn(pattern, repl, string, count=0, flags=0): TypeError: expected string or bytes-like object
maybe there is some NaN cells in your text column.
Did you solve this? I have the same problem.
When I'm trying to execute the BertDataBunch class, with the following parameters:
I'm getting the following errror: