I am getting this error as soon as I get this
INFO:root:Writing example 0 of 9067886
I am running the standard code on a AWS Sagemaker with pytorch. Both the error stack and the code used is pasted below. I have no idea what is causing this, any help would be greatly appreciated!
Error-
INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /home/ec2-user/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
INFO:root:Writing example 0 of 9067886
AttributeError Traceback (most recent call last)
in ()
16 multi_gpu=True,
17 multi_label=False,
---> 18 model_type='bert')
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fast_bert/data_cls.py in __init__(self, data_dir, label_dir, tokenizer, train_file, val_file, test_data, label_file, text_col, label_col, batch_size_per_gpu, max_seq_length, multi_gpu, multi_label, backend, model_type, logger, clear_cache, no_cache)
347
348 train_dataset = self.get_dataset_from_examples(
--> 349 train_examples, 'train')
350
351 self.train_batch_size = self.batch_size_per_gpu * \
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fast_bert/data_cls.py in get_dataset_from_examples(self, examples, set_type, is_test, no_cache)
443 pad_on_left=bool(self.model_type in ['xlnet']),
444 pad_token_segment_id=4 if self.model_type in ['xlnet'] else 0,
--> 445 logger=self.logger)
446
447 # Create folder if it doesn't exist
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fast_bert/data_cls.py in convert_examples_to_features(examples, label_list, max_seq_length, tokenizer, output_mode, cls_token_at_end, pad_on_left, cls_token, sep_token, pad_token, sequence_a_segment_id, sequence_b_segment_id, cls_token_segment_id, pad_token_segment_id, mask_padding_with_zero, logger)
105 (ex_index, len(examples)))
106
--> 107 tokens_a = tokenizer.tokenize(example.text_a)
108
109 tokens_b = None
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/tokenization_utils.py in tokenize(self, text, **kwargs)
651
652 added_tokens = list(self.added_tokens_encoder.keys()) + self.all_special_tokens
--> 653 tokenized_text = split_on_tokens(added_tokens, text)
654 return tokenized_text
655
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/tokenization_utils.py in split_on_tokens(tok_list, text)
641 if sub_text not in self.added_tokens_encoder \
642 and sub_text not in self.all_special_tokens:
--> 643 tokenized_text += split_on_token(tok, sub_text)
644 else:
645 tokenized_text += [sub_text]
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/transformers/tokenization_utils.py in split_on_token(tok, text)
612 def split_on_token(tok, text):
613 result = []
--> 614 split_text = text.split(tok)
615 for i, sub_text in enumerate(split_text):
616 sub_text = sub_text.strip()
AttributeError: 'float' object has no attribute 'split'
**Code**
from fast_bert.data_cls import BertDataBunch
DATA_PATH='/home/ec2-user/SageMaker'
LABEL_PATH='/home/ec2-user/SageMaker'
databunch = BertDataBunch(DATA_PATH, LABEL_PATH,
tokenizer='bert-base-uncased',
train_file='train.csv',
val_file='val.csv',
label_file='labels.csv',
text_col='text',
label_col='label',
batch_size_per_gpu=16,
max_seq_length=512,
multi_gpu=True,
multi_label=False,
model_type='bert')
I am getting this error as soon as I get this INFO:root:Writing example 0 of 9067886
I am running the standard code on a AWS Sagemaker with pytorch. Both the error stack and the code used is pasted below. I have no idea what is causing this, any help would be greatly appreciated!
Error-
INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /home/ec2-user/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084 INFO:root:Writing example 0 of 9067886
AttributeError Traceback (most recent call last)