utterworks / fast-bert

Super easy library for BERT based NLP models
Apache License 2.0
1.87k stars 341 forks source link

Databunch on AWS - AttributeError: 'float' object has no attribute 'split' #78

Open vinayak-MLAI opened 5 years ago

vinayak-MLAI commented 5 years ago

I am getting this error as soon as I get this INFO:root:Writing example 0 of 9067886

I am running the standard code on a AWS Sagemaker with pytorch. Both the error stack and the code used is pasted below. I have no idea what is causing this, any help would be greatly appreciated!

Error-

INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /home/ec2-user/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084 INFO:root:Writing example 0 of 9067886

AttributeError Traceback (most recent call last)

in () 16 multi_gpu=True, 17 multi_label=False, ---> 18 model_type='bert') ~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fast_bert/data_cls.py in __init__(self, data_dir, label_dir, tokenizer, train_file, val_file, test_data, label_file, text_col, label_col, batch_size_per_gpu, max_seq_length, multi_gpu, multi_label, backend, model_type, logger, clear_cache, no_cache) 347 348 train_dataset = self.get_dataset_from_examples( --> 349 train_examples, 'train') 350 351 self.train_batch_size = self.batch_size_per_gpu * \ ~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fast_bert/data_cls.py in get_dataset_from_examples(self, examples, set_type, is_test, no_cache) 443 pad_on_left=bool(self.model_type in ['xlnet']), 444 pad_token_segment_id=4 if self.model_type in ['xlnet'] else 0, --> 445 logger=self.logger) 446 447 # Create folder if it doesn't exist ~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fast_bert/data_cls.py in convert_examples_to_features(examples, label_list, max_seq_length, tokenizer, output_mode, cls_token_at_end, pad_on_left, cls_token, sep_token, pad_token, sequence_a_segment_id, sequence_b_segment_id, cls_token_segment_id, pad_token_segment_id, mask_padding_with_zero, logger) 105 (ex_index, len(examples))) 106 --> 107 tokens_a = tokenizer.tokenize(example.text_a) 108 109 tokens_b = None ~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/tokenization_utils.py in tokenize(self, text, **kwargs) 651 652 added_tokens = list(self.added_tokens_encoder.keys()) + self.all_special_tokens --> 653 tokenized_text = split_on_tokens(added_tokens, text) 654 return tokenized_text 655 ~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/tokenization_utils.py in split_on_tokens(tok_list, text) 641 if sub_text not in self.added_tokens_encoder \ 642 and sub_text not in self.all_special_tokens: --> 643 tokenized_text += split_on_token(tok, sub_text) 644 else: 645 tokenized_text += [sub_text] ~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/tokenization_utils.py in split_on_token(tok, text) 612 def split_on_token(tok, text): 613 result = [] --> 614 split_text = text.split(tok) 615 for i, sub_text in enumerate(split_text): 616 sub_text = sub_text.strip() AttributeError: 'float' object has no attribute 'split' **Code** from fast_bert.data_cls import BertDataBunch DATA_PATH='/home/ec2-user/SageMaker' LABEL_PATH='/home/ec2-user/SageMaker' databunch = BertDataBunch(DATA_PATH, LABEL_PATH, tokenizer='bert-base-uncased', train_file='train.csv', val_file='val.csv', label_file='labels.csv', text_col='text', label_col='label', batch_size_per_gpu=16, max_seq_length=512, multi_gpu=True, multi_label=False, model_type='bert')
aaranda7 commented 5 years ago

I am having the same issue

Saphirox commented 5 years ago

You need to clean sentences from NaN values, it helped me.

enzoampil commented 5 years ago

@Saphirox what do you mean exactly by this?

enzoampil commented 5 years ago

Nevermind got it. Just dropped the NaNs