BertDataBunch problem: TypeError: expected string or bytes-like object

kujaomega commented 4 years ago

When I'm trying to execute the BertDataBunch class, with the following parameters:

BertDataBunch('./', './',
                          tokenizer='bert-base-uncased',
                          train_file='train.csv',
                          val_file='val.csv',
                          label_file='labels.csv',
                          text_col='text',
                          label_col=[
                              "POSITIVE",
                              "NEGATIVE"
                          ],
                          batch_size_per_gpu=16,
                          max_seq_length=512,
                          multi_gpu=False,
                          multi_label=True, model_type='bert')

I'm getting the following errror:

TypeError                                 Traceback (most recent call last)
<ipython-input-4-8957d7fa0907> in <module>
     43                           max_seq_length=512,
     44                           multi_gpu=False,
---> 45                           multi_label=True, model_type='bert')

/usr/local/lib/python3.6/dist-packages/fast_bert/data_cls.py in __init__(self, data_dir, label_dir, tokenizer, train_file, val_file, test_data, label_file, text_col, label_col, batch_size_per_gpu, max_seq_length, multi_gpu, multi_label, backend, model_type, logger, clear_cache, no_cache)
    453 
    454             train_dataset = self.get_dataset_from_examples(
--> 455                 train_examples, "train", no_cache=self.no_cache
    456             )
    457 

/usr/local/lib/python3.6/dist-packages/fast_bert/data_cls.py in get_dataset_from_examples(self, examples, set_type, is_test, no_cache)
    580                 pad_on_left=bool(self.model_type in ["xlnet"]),
    581                 pad_token_segment_id=4 if self.model_type in ["xlnet"] else 0,
--> 582                 logger=self.logger,
    583             )
    584 

/usr/local/lib/python3.6/dist-packages/fast_bert/data_cls.py in convert_examples_to_features(examples, label_list, max_seq_length, tokenizer, output_mode, cls_token_at_end, pad_on_left, cls_token, sep_token, pad_token, sequence_a_segment_id, sequence_b_segment_id, cls_token_segment_id, pad_token_segment_id, mask_padding_with_zero, logger)
    133                 logger.info("Writing example %d of %d" % (ex_index, len(examples)))
    134 
--> 135         tokens_a = tokenizer.tokenize(example.text_a)
    136 
    137         tokens_b = None

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in tokenize(self, text, **kwargs)
    674 
    675         if self.init_kwargs.get("do_lower_case", False):
--> 676             text = lowercase_text(text)
    677 
    678         def split_on_token(tok, text):

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in lowercase_text(t)
    671             escaped_special_toks = [re.escape(s_tok) for s_tok in all_special_tokens]
    672             pattern = r"(" + r"|".join(escaped_special_toks) + r")|" + r"(.+?)"
--> 673             return re.sub(pattern, lambda m: m.groups()[0] or m.groups()[1].lower(), t)
    674 
    675         if self.init_kwargs.get("do_lower_case", False):

/usr/lib/python3.6/re.py in sub(pattern, repl, string, count, flags)
    189     a callable, it's passed the match object and must return
    190     a replacement string to be used."""
--> 191     return _compile(pattern, flags).sub(repl, string, count)
    192 
    193 def subn(pattern, repl, string, count=0, flags=0):

TypeError: expected string or bytes-like object

siyanew commented 4 years ago

maybe there is some NaN cells in your text column.

liw084 commented 4 years ago

Did you solve this? I have the same problem.

utterworks / fast-bert

BertDataBunch problem: TypeError: expected string or bytes-like object #171