utterworks / fast-bert

Super easy library for BERT based NLP models
Apache License 2.0
1.85k stars 342 forks source link

Multilabel ValueError: could not convert string to float #235

Closed ndolev closed 4 years ago

ndolev commented 4 years ago

I am attempting to create a BertDataBunch for a multilabel classification exactly like in the readme. I provide a list of labels but it seems like data_cls.py is expecting the labels to be floats instead of strings. Any ideas?

databunch = BertDataBunch(DATA_PATH, LABEL_PATH,
                          tokenizer='bert-base-uncased',
                          train_file='bert_train_set.csv',
                          val_file='bert_val_set.csv',
                          label_file='bert_labels.csv',
                          text_col='text',
                          label_col=['label1',label2','label3'],
                          batch_size_per_gpu=16,
                          max_seq_length=512,
                          multi_gpu=True,
                          multi_label=True,
                          model_type='bert')
~/anaconda3/envs/pytorch/lib/python3.7/site-packages/fast_bert/data_cls.py in convert_examples_to_features(examples, label_list, max_seq_length, tokenizer, output_mode, cls_token_at_end, pad_on_left, cls_token, sep_token, pad_token, sequence_a_segment_id, sequence_b_segment_id, cls_token_segment_id, pad_token_segment_id, mask_padding_with_zero, logger)
    174             label_id = []
    175             for label in example.label:
--> 176                 label_id.append(float(label))
    177         else:
    178             if example.label is not None:

And my bert_labels.csv looks like:

label1
label2
label3

And bert_train_set like:

index,text,label1,label2,label3
ndolev commented 4 years ago

The error message made it hard to diagnose but the problem was on my side - a string snuck into my one hot encoded multi-label data set. :)